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Preface 


Through exposure to the news and social media, you are probably aware of the fact 
that machine learning has become one of the most exciting technologies of our time 
and age. Large companies, such as Google, Facebook, Apple, Amazon, and IBM, 
heavily invest in machine learning research and applications for good reasons. While 
it may seem that machine learning has become the buzzword of our time and age, it 
is certainly not a fad. This exciting field opens the way to new possibilities and has 
become indispensable to our daily lives. This 1s evident in talking to the voice 
assistant on our smartphones, recommending the right product for our customers, 
preventing credit card fraud, filtering out spam from our email inboxes, detecting 
and diagnosing medical diseases, the list goes on and on. 


If you want to become a machine learning practitioner, a better problem solver, or 
maybe even consider a career in machine learning research, then this book is for you. 
However, for a novice, the theoretical concepts behind machine learning can be quite 
overwhelming. Many practical books have been published in recent years that will 
help you get started in machine learning by implementing powerful learning 
algorithms. 


Getting exposed to practical code examples and working through example 
applications of machine learning are a great way to dive into this field. Concrete 
examples help illustrate the broader concepts by putting the learned material directly 
into action. However, remember that with great power comes great responsibility! In 
addition to offering a hands-on experience with machine learning using the Python 
programming languages and Python-based machine learning libraries, this book 
introduces the mathematical concepts behind machine learning algorithms, which 1s 
essential for using machine learning successfully. Thus, this book 1s different from a 
purely practical book; it is a book that discusses the necessary details regarding 
machine learning concepts and offers intuitive yet informative explanations of how 
machine learning algorithms work, how to use them, and most importantly, how to 
avoid the most common pitfalls. 


Currently, if you type "machine learning" as a search term in Google Scholar, it 
returns an overwhelmingly large number of publications—1,800,000. Of course, we 
cannot discuss the nitty-gritty of all the different algorithms and applications that 
have emerged 1n the last 60 years. However, in this book, we will embark on an 
exciting journey that covers all the essential topics and concepts to give you a head 
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start in this field. If you find that your thirst for knowledge is not satisfied, this book 
references many useful resources that can be used to follow up on the essential 
breakthroughs in this field. 


If you have already studied machine learning theory in detail, this book will show 
you how to put your knowledge into practice. If you have used machine learning 
techniques before and want to gain more insight into how machine learning actually 
works, this book is for you. Don't worry if you are completely new to the machine 
learning field; you have even more reason to be excited. Here is a promise that 
machine learning will change the way you think about the problems you want to 
solve and will show you how to tackle them by unlocking the power of data. 


Before we dive deeper into the machine learning field, let's answer your most 
important question, "Why Python?" The answer is simple: it is powerful yet very 
accessible. Python has become the most popular programming language for data 
science because it allows us to forget about the tedious parts of programming and 
offers us an environment where we can quickly jot down our ideas and put concepts 
directly into action. 


We, the authors, can truly say that the study of machine learning has made us better 
scientists, thinkers, and problem solvers. In this book, we want to share this 
knowledge with you. Knowledge 1s gained by learning. The key 1s our enthusiasm, 
and the real mastery of skills can only be achieved by practice. The road ahead may 
be bumpy on occasions and some topics may be more challenging than others, but 
we hope that you will embrace this opportunity and focus on the reward. Remember 
that we are on this journey together, and throughout this book, we will add many 
powerful techniques to your arsenal that will help us solve even the toughest 
problems the data-driven way. 
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What this book covers 


Chapter 1 , Giving Computers the Ability to Learn from Data, introduces you to the 
main subareas of machine learning in order to tackle various problem tasks. In 
addition, it discusses the essential steps for creating a typical machine learning 
model by building a pipeline that will guide us through the following chapters. 


Chapter 2 , Training Simple Machine Learning Algorithms for Classification, goes 
back to the origins of machine learning and introduces binary perceptron classifiers 
and adaptive linear neurons. This chapter is a gentle introduction to the fundamentals 
of pattern classification and focuses on the interplay of optimization algorithms and 
machine learning. 


Chapter 3 , A Tour of Machine Learning Classifiers Using scikit-learn, describes the 
essential machine learning algorithms for classification and provides practical 
examples using one of the most popular and comprehensive open source machine 
learning libraries: scikit-learn. 


Chapter 4 , Building Good Training Sets — Data Preprocessing, discusses how to 
deal with the most common problems 1n unprocessed datasets, such as missing data. 
It also discusses several approaches to identify the most informative features in 
datasets and teaches you how to prepare variables of different types as proper input 
for machine learning algorithms. 


Chapter 5 , Compressing Data via Dimensionality Reduction, describes the essential 
techniques to reduce the number of features in a dataset to smaller sets while 
retaining most of their useful and discriminatory information. It discusses the 
standard approach to dimensionality reduction via principal component analysis and 
compares it to supervised and nonlinear transformation techniques. 


Chapter 6 , Learning Best Practices for Model Evaluation and Hyperparameter 
Tuning, discusses the dos and don'ts for estimating the performances of predictive 
models. Moreover, it discusses different metrics for measuring the performance of 
our models and techniques to fine-tune machine learning algorithms. 


Chapter 7 , Combining Different Models for Ensemble Learning, introduces you to 
the different concepts of combining multiple learning algorithms effectively. It 
teaches you how to build ensembles of experts to overcome the weaknesses of 
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individual learners, resulting in more accurate and reliable predictions. 


Chapter 8 , Applying Machine Learning to Sentiment Analysis, discusses the 
essential steps to transform textual data into meaningful representations for machine 
learning algorithms to predict the opinions of people based on their writing. 


Chapter 9 , Embedding a Machine Learning Model into a Web Application, 
continues with the predictive model from the previous chapter and walks you 
through the essential steps of developing web applications with embedded machine 
learning models. 


Chapter 10 , Predicting Continuous Target Variables with Regression Analysis, 
discusses the essential techniques for modeling linear relationships between target 
and response variables to make predictions on a continuous scale. After introducing 
different linear models, it also talks about polynomial regression and tree-based 
approaches. 


Chapter 11 , Working with Unlabeled Data — Clustering Analysis, shifts the focus to 
a different subarea of machine learning, unsupervised learning. We apply algorithms 
from three fundamental families of clustering algorithms to find groups of objects 
that share a certain degree of similarity. 


Chapter 12 , Implementing a Multilayer Artificial Neural Network from Scratch, 
extends the concept of gradient-based optimization, which we first introduced in 
Chapter 2, Training Simple Machine Learning Algorithms for Classification, to 
build powerful, multilayer neural networks based on the popular backpropagation 
algorithm in Python. 


Chapter 13 , Parallelizing Neural Network Training with TensorFlow, builds upon 
the knowledge from the previous chapter to provide you with a practical guide for 
training neural networks more efficiently. The focus of this chapter 1s on 
TensorFlow, an open source Python library that allows us to utilize multiple cores of 
modern GPUs. 


Chapter 14, Going Deeper — The Mechanics of TensorFlow, covers TensorFlow in 
greater detail explaining its core concepts of computational graphs and sessions. In 
addition, this chapter covers topics such as saving and visualizing neural network 

graphs, which will come in very handy during the remaining chapters of this book. 
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Chapter 15 , Classifying Images with Deep Convolutional Neural Networks, 
discusses deep neural network architectures that have become the new standard in 
computer vision and image recognition fields—convolutional neural networks. This 
chapter will discuss the main concepts between convolutional layers as a feature 
extractor and apply convolutional neural network architectures to an image 
classification task to achieve almost perfect classification accuracy. 


Chapter 16 , Modeling Sequential Data Using Recurrent Neural Networks, 
introduces another popular neural network architecture for deep learning that is 
especially well suited for working with sequential data and time series data. In this 
chapter, we will apply different recurrent neural network architectures to text data. 
We will start with a sentiment analysis task as a warm-up exercise and will learn 
how to generate entirely new text. 
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What you need for this book 


The execution of the code examples provided in this book requires an installation of 
Python 3.6.0 or newer on macOS, Linux, or Microsoft Windows. We will make 
frequent use of Python's essential libraries for scientific computing throughout this 
book, including SciPy, NumPy, scikit-learn, Matplotlib, and pandas. 


The first chapter will provide you with instructions and useful tips to set up your 
Python environment and these core libraries. We will add additional libraries to our 
repertoire; moreover, installation instructions are provided in the respective chapters: 
the NLTK library for natural language processing (Chapter 8, Applying Machine 
Learning to Sentiment Analysis), the Flask web framework (Chapter 9, Embedding a 
Machine Learning Algorithm into a Web Application), the Seaborn library for 
statistical data visualization (Chapter 10, Predicting Continuous Target Variables 
with Regression Analysis), and TensorFlow for efficient neural network training on 
graphical processing units (Chapters 13 to 16). 
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Who this book ts for 


If you want to find out how to use Python to start answering critical questions of 
your data, pick up Python Machine Learning, Second Edition—whether you want to 
start from scratch or extend your data science knowledge, this 1s an essential and 
unmissable resource. 
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Conventions 


In this book, you will find a number of text styles that distinguish between different 
kinds of information. Here are some examples of these styles and an explanation of 
their meaning. 


Code words in text, database table names, folder names, filenames, file extensions, 
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: 
"Using the out file=None setting, we directly assigned the dot data to a dot data 
variable, instead of writing an intermediate tree.dot file to disk." 


A block of code 1s set as follows: 


>>> from sklearn.neighbors import KNeighborsClassifier 
Jor knit = KNeGignborsClassitter(n Nelgnbors=5, p=Z, 

ee metric='minkowsk1"') 

>>> KNN«EIt(x train std, y train) 
por P1Oe GSC sS.0n: treq.ons (x Combined sta, y Comoined, 

sae Choos Tera kih, Geel Tac= ange (00,130) 
>>> plt.xlabel('petal length [standardized] ') 

>>> plt.ylabel('petal width [standardized] ') 

>>> plt.show() 


Any command-line input or output is written as follows: 
pip3 install graphviz 


New terms and important words are shown in bold. Words that you see on the 
screen, for example, in menus or dialog boxes, appear in the text like this: "After we 
click on the Dashboard button in the top-right corner, we have access to the control 
panel shown at the top of the page." 


Note 
Warnings or important notes appear in a box like this. 
Tip 


Tips and tricks appear like this. 
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Reader feedback 


Feedback from our readers is always welcome. Let us know what you think about 
this book—what you liked or disliked. Reader feedback 1s important for us as it 
helps us develop titles that you will really get the most out of. 


To send us general feedback, simply email <feedback@packtpub.com>, and mention 
the book's title in the subject of your message. 


If there is a topic that you have expertise in and you are interested in either writing or 
contributing to a book, see our author guide at www.packtpub.com/authors. 
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Customer support 


Now that you are the proud owner of a Packt book, we have a number of things to 
help you to get the most from your purchase. 
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Downloading the example code 


You can download the example code files for this book from your account at 
http://www.packtpub.com. If you purchased this book elsewhere, you can visit 
http://www.packtpub.com/support and register to have the files emailed directly to 
you. 


You can download the code files by following these steps: 


1. Log in or register to our website using your email address and password. 
. Hover the mouse pointer on the SUPPORT tab at the top. 

. Click on Code Downloads & Errata. 

. Enter the name of the book in the Search box. 

. Select the book for which you're looking to download the code files. 

. Choose from the drop-down menu where you purchased this book from. 
. Click on Code Download. 


SHA Nn BB W WN 


You can also download the code files by clicking on the Code Files button on the 
book's web page at the Packt Publishing website. This page can be accessed by 
entering the book's name in the Search box. Please note that you need to be logged 
in to your Packt account. 


Once the file is downloaded, please make sure that you unzip or extract the folder 
using the latest version of: 


e WinRAR / 7-Zip for Windows 
e Zipeg /1Zip / UnRarX for Mac 
e 7-Zip / PeaZip for Linux 


The code bundle for the book is also hosted on GitHub at 


https://github.com/PacktPublishing/Python-Machine-Learning-Second-Edition. We 
also have other code bundles from our rich catalog of books and videos available at 


https://github.com/PacktPublishing/. Check them out! 
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Downloading the color images of this book 


We also provide you with a PDF file that has color images of the 
screenshots/diagrams used in this book. The color images will help you better 
understand the changes in the output. You can download this file from 
http://www.packtpub.com/sites/default/files/downloads/PythonMachineLearningSecc 
In addition, lower resolution color images are embedded in the code notebooks of 
this book that come bundled with the example code files. 
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Errata 


Although we have taken every care to ensure the accuracy of our content, mistakes 
do happen. If you find a mistake in one of our books—maybe a mistake in the text or 
the code—we would be grateful if you could report this to us. By doing so, you can 
save other readers from frustration and help us improve subsequent versions of this 
book. If you find any errata, please report them by visiting 
http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata 
Submission Form link, and entering the details of your errata. Once your errata are 
verified, your submission will be accepted and the errata will be uploaded to our 
website or added to any list of existing errata under the Errata section of that title. 


To view the previously submitted errata, go to 
https://www.packtpub.com/books/content/support and enter the name of the book in 
the search field. The required information will appear under the Errata section. 
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Piracy 


Piracy of copyrighted material on the Internet 1s an ongoing problem across all 
media. At Packt, we take the protection of our copyright and licenses very seriously. 
If you come across any illegal copies of our works in any form on the Internet, 
please provide us with the location address or website name immediately so that we 
can pursue a remedy. 


Please contact us at <copyright@packtpub.com> with a link to the suspected pirated 
material. 


We appreciate your help in protecting our authors and our ability to bring you 
valuable content. 
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Questions 


If you have a problem with any aspect of this book, you can contact us at 
<questions@packtpub.com>, and we will do our best to address the problem. 
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Chapter 1. Giving Computers the 
Ability to Learn from Data 


In my opinion, machine learning, the application and science of algorithms that 
make sense of data, 1s the most exciting field of all the computer sciences! We are 
living in an age where data comes 1n abundance; using self-learning algorithms from 
the field of machine learning, we can turn this data into knowledge. Thanks to the 
many powerful open source libraries that have been developed 1n recent years, there 
has probably never been a better time to break into the machine learning field and 
learn how to utilize powerful algorithms to spot patterns in data and make 
predictions about future events. 


In this chapter, you will learn about the main concepts and different types of 
machine learning. Together with a basic introduction to the relevant terminology, we 
will lay the groundwork for successfully using machine learning techniques for 
practical problem solving. 


In this chapter, we will cover the following topics: 


e The general concepts of machine learning 

e The three types of learning and basic terminology 

e The building blocks for successfully designing machine learning systems 
e Installing and setting up Python for data analysis and machine learning 
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Building intelligent machines to 
transform data into knowledge 


In this age of modern technology, there is one resource that we have in abundance: a 
large amount of structured and unstructured data. In the second half of the twentieth 
century, machine learning evolved as a subfield of Artificial Intelligence (AI) that 
involved self-learning algorithms that derived knowledge from data in order to make 
predictions. Instead of requiring humans to manually derive rules and build models 
from analyzing large amounts of data, machine learning offers a more efficient 
alternative for capturing the knowledge in data to gradually improve the performance 
of predictive models and make data-driven decisions. Not only 1s machine learning 
becoming increasingly important in computer science research, but it also plays an 
ever greater role in our everyday lives. Thanks to machine learning, we enjoy robust 
email spam filters, convenient text and voice recognition software, reliable web 
search engines, challenging chess-playing programs, and, hopefully soon, safe and 
efficient self-driving cars. 
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The three different types of machine 
learning 


In this section, we will take a look at the three types of machine learning: supervised 
learning, unsupervised learning, and reinforcement learning. We will learn about 
the fundamental differences between the three different learning types and, using 
conceptual examples, we will develop an intuition for the practical problem domains 
where these can be applied: 


> Labeled data 


Supervised Learning > Direct feedback 


> Predict outcome/future 


~> No labels 


Unsupervised Learning » No feedback 
> Find hidden structure in data 





> Decision process 


Reinforcement Learning > Reward system 


> Learn series of actions 





WOW! eBook 
www.wowebook.org 


Making predictions about the future with 
Supervised learning 
The main goal in supervised learning is to learn a model from labeled training data 


that allows us to make predictions about unseen or future data. Here, the term 
Supervised refers to a set of samples where the desired output signals (labels) are 





already known. 
Labels 
Training Data 
METelallatom mor-laallals 
Algorithm 
New Data —. Predictive Model —~> Prediction 


Considering the example of email spam filtering, we can train a model using a 
Supervised machine learning algorithm on a corpus of labeled emails, emails that are 
correctly marked as spam or not-spam, to predict whether a new email belongs to 
either of the two categories. A supervised learning task with discrete class labels, 
such as in the previous email spam filtering example, 1s also called a classification 
task. Another subcategory of supervised learning is regression, where the outcome 
signal is a continuous value: 


Classification for predicting class labels 
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Classification 1s a subcategory of supervised learning where the goal is to predict the 
categorical class labels of new instances, based on past observations. Those class 
labels are discrete, unordered values that can be understood as the group 
memberships of the instances. The previously mentioned example of email spam 
detection represents a typical example of a binary classification task, where the 
machine learning algorithm learns a set of rules in order to distinguish between two 
possible classes: spam and non-spam emails. 


However, the set of class labels does not have to be of a binary nature. The 
predictive model learned by a supervised learning algorithm can assign any class 
label that was presented in the training dataset to a new, unlabeled instance. A 
typical example of a multiclass classification task 1s handwritten character 
recognition. Here, we could collect a training dataset that consists of multiple 
handwritten examples of each letter in the alphabet. Now, if a user provides a new 
handwritten character via an input device, our predictive model will be able to 
predict the correct letter in the alphabet with certain accuracy. However, our machine 
learning system would be unable to correctly recognize any of the digits zero to nine, 
for example, if they were not part of our training dataset. 


The following figure illustrates the concept of a binary classification task given 30 
training samples; 15 training samples are labeled as negative class (minus signs) and 
15 training samples are labeled as positive class (plus signs). In this scenario, our 
dataset 1s two-dimensional, which means that each sample has two values associated 
with it: and ~-.Now, we can use a supervised machine learning algorithm to 
learn a rule—the decision boundary represented as a dashed line—that can separate 
those two classes and classify new data into each of those two categories given its 


xX a 
| and ~ 2 values: 
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Regression for predicting continuous outcomes 


We learned in the previous section that the task of classification is to assign 
categorical, unordered labels to instances. A second type of supervised learning 1s 
the prediction of continuous outcomes, which 1s also called regression analysis. In 
regression analysis, we are given a number of predictor (explanatory) variables and 
a continuous response variable (outcome or target), and we try to find a relationship 
between those variables that allows us to predict an outcome. 


WOW! eBook 
www.wowebook.org 


For example, let's assume that we are interested 1n predicting the math SAT scores of 
our students. If there is a relationship between the time spent studying for the test 
and the final scores, we could use it as training data to learn a model that uses the 
study time to predict the test scores of future students who are planning to take this 
test. 


Note 


The term regression was devised by Francis Galton in his article Regression towards 
Mediocrity in Hereditary Stature 1n 1886. Galton described the biological 
phenomenon that the variance of height in a population does not increase over time. 
He observed that the height of parents is not passed on to their children, but instead 
the children's height 1s regressing towards the population mean. 


The following figure illustrates the concept of linear regression. Given a predictor 
variable x and a response variable y, we fit a straight line to this data that minimizes 
the distance—most commonly the average squared distance—between the sample 
points and the fitted line. We can now use the intercept and slope learned from this 
data to predict the outcome variable of new data: 
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Solving interactive problems with reinforcement 
learning 


Another type of machine learning is reinforcement learning. In reinforcement 
learning, the goal is to develop a system (agent) that improves its performance based 
on interactions with the environment. Since the information about the current state of 
the environment typically also includes a so-called reward signal, we can think of 
reinforcement learning as a field related to supervised learning. However, in 
reinforcement learning this feedback 1s not the correct ground truth label or value, 
but a measure of how well the action was measured by a reward function. Through 
its interaction with the environment, an agent can then use reinforcement learning to 
learn a series of actions that maximizes this reward via an exploratory trial-and-error 
approach or deliberative planning. 


A popular example of reinforcement learning 1s a chess engine. Here, the agent 
decides upon a series of moves depending on the state of the board (the 
environment), and the reward can be defined as win or lose at the end of the game: 


savdinelalaatsal® 


Reward 





Action 


Agent 


There are many different subtypes of reinforcement learning. However, a general 
scheme is that the agent in reinforcement learning tries to maximize the reward by a 
series of interactions with the environment. Each state can be associated with a 
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positive or negative reward, and a reward can be defined as accomplishing an overall 
goal, such as winning or losing a game of chess. For instance, in chess the outcome 
of each move can be thought of as a different state of the environment. To explore 
the chess example further, let's think of visiting certain locations on the chess board 
as being associated with a positive event—for instance, removing an opponent's 
chess piece from the board or threatening the queen. Other positions, however, are 
associated with a negative event, such as losing a chess piece to the opponent in the 
following turn. Now, not every turn results in the removal of a chess piece, and 
reinforcement learning is concerned with learning the series of steps by maximizing 
a reward based on immediate and delayed feedback. 


While this section provides a basic overview of reinforcement learning, please note 
that applications of reinforcement learning are beyond the scope of this book, which 
primarily focusses on classification, regression analysis, and clustering. 
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Discovering hidden structures with 
unsupervised learning 


In supervised learning, we know the right answer beforehand when we train our 
model, and in reinforcement learning, we define a measure of reward for particular 
actions by the agent. In unsupervised learning, however, we are dealing with 
unlabeled data or data of unknown structure. Using unsupervised learning 
techniques, we are able to explore the structure of our data to extract meaningful 
information without the guidance of a known outcome variable or reward function. 


Finding subgroups with clustering 


Clustering is an exploratory data analysis technique that allows us to organize a pile 
of information into meaningful subgroups (clusters) without having any prior 
knowledge of their group memberships. Each cluster that arises during the analysis 
defines a group of objects that share a certain degree of similarity but are more 
dissimilar to objects in other clusters, which is why clustering is also sometimes 
called unsupervised classification. Clustering is a great technique for structuring 
information and deriving meaningful relationships from data. For example, it allows 
marketers to discover customer groups based on their interests, in order to develop 
distinct marketing programs. 


The following figure illustrates how clustering can be applied to organizing 


_ 


X 
unlabeled data into three distinct groups based on the similarity of their features 


x 
and -: 
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Dimensionality reduction for data compression 


Another subfield of unsupervised learning is dimensionality reduction. Often we 
are working with data of high dimensionality—each observation comes with a high 
number of measurements—that can present a challenge for limited storage space and 
the computational performance of machine learning algorithms. Unsupervised 
dimensionality reduction 1s a commonly used approach in feature preprocessing to 
remove noise from data, which can also degrade the predictive performance of 
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certain algorithms, and compress the data onto a smaller dimensional subspace while 
retaining most of the relevant information. 


Sometimes, dimensionality reduction can also be useful for visualizing data, for 
example, a high-dimensional feature set can be projected onto one-, two-, or three- 
dimensional feature spaces in order to visualize it via 3D or 2D scatterplots or 
histograms. The following figure shows an example where nonlinear dimensionality 
reduction was applied to compress a 3D Swiss Roll onto a new 2D feature subspace: 
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Introduction to the basic terminology 
and notations 


Now that we have discussed the three broad categories of machine learning— 
supervised, unsupervised, and reinforcement learning—let us have a look at the basic 
terminology that we will be using throughout the book. The following table depicts 
an excerpt of the Iris dataset, which is a classic example in the field of machine 
learning. The Iris dataset contains the measurements of 150 Iris flowers from three 
different species—Setosa, Versicolor, and Virginica. Here, each flower sample 
represents one row in our dataset, and the flower measurements in centimeters are 
stored as columns, which we also call the features of the dataset: 


| Petal 
Samples me, 
(instances, observations) 








Sepal Sepal Petal Petal Class 
length width length width label 


} 4 | 
~ Sepal 
/ Class labels 


(targets) 












Features 
(attributes, measurements, dimensions) 
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To keep the notation and implementation simple yet efficient, we will make use of 
some of the basics of linear algebra. In the following chapters, we will use a matrix 
and vector notation to refer to our data. We will follow the common convention to 
represent each sample as a separate row 1n a feature matrix X, where each feature 1s 
stored as a separate column. 


The Iris dataset consisting of 150 samples and four features can then be written as a 
: ap 150x4 
L304 natrix X ER 


Al) (1) (1) (1) 


x, x5 a x 
(=) A2) A2) (2) 
\, As Vs Xs 
A1a0) 10} A1S0) ALSO) | 
ze | . 2 Xs x4 
Note 


For the rest of this book, unless noted otherwise, we will use the superscript 7 to refer 
to the ith training sample, and the subscript 7 to refer to the jth dimension of the 
training dataset. 


(xe R™| 


We use lowercase, bold-face letters to refer to vectors ‘ * and uppercase, 
J ie MMT 
(Ack) . To refer to single elements in a 


Ar) 


bold-face letters to refer to matrices 


(a) oe 
vector or matrix, we write the letters in italics(* or ‘”? , respectively). 


150 
For example, | refers to the first dimension of flower sample 150, the sepal 


length. Thus, each row in this feature matrix represents one flower instance and can 


ti) — 7hilx4 
be written as a four-dimensional row vector * © Ik 
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» LO] 
i, x.ER 
And each feature dimension 1s a 150-dimensional column vector / . For 
example: 
. 
x; 
( r, 
A | 
od — . 
_ (150) 
x 


Similarly, we store the target variables (here, class labels) as a 150-dimensional 


al L | 
1. 


: ae (y < /Setosa, Versicolor, Virginica} } 
(iM) 


column vector: 
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A roadmap for building machine 
learning systems 


In previous sections, we discussed the basic concepts of machine learning and the 
three different types of learning. In this section, we will discuss the other important 
parts of a machine learning system accompanying the learning algorithm. The 
following diagram shows a typical workflow for using machine learning in 
predictive modeling, which we will discuss 1n the following subsections: 


——— 
os Se 
— — 
a 
——— —_ 


_ ce 
i Feature Extraction and Scaling an 
| Feature Selection 
Ne Dimensionality Reduction 

} sampling , 


Fi — 


- s =_ 





labels -- 7 


A 


Training Dataset 
Learning 


Algorithm Final Model New Data 


Test Dataset cmat ) | 
Labels 


Preprocessing Prediction 


a Model Selection ts 
| Cross-Validation 


\ Performance Metrics 
SS Hyperparameter Optimization | 
oe 
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Preprocessing — getting data into shape 


Let's begin with discussing the roadmap for building machine learning systems. Raw 
data rarely comes in the form and shape that is necessary for the optimal 
performance of a learning algorithm. Thus, the preprocessing of the data is one of the 
most crucial steps in any machine learning application. If we take the Iris flower 
dataset from the previous section as an example, we can think of the raw data as a 
series of flower images from which we want to extract meaningful features. Useful 
features could be the color, the hue, the intensity of the flowers, the height, and the 
flower lengths and widths. Many machine learning algorithms also require that the 
selected features are on the same scale for optimal performance, which 1s often 
achieved by transforming the features in the range [0, 1] or a standard normal 
distribution with zero mean and unit variance, as we will see in later chapters. 


Some of the selected features may be highly correlated and therefore redundant to a 
certain degree. In those cases, dimensionality reduction techniques are useful for 
compressing the features onto a lower dimensional subspace. Reducing the 
dimensionality of our feature space has the advantage that less storage space is 
required, and the learning algorithm can run much faster. In certain cases, 
dimensionality reduction can also improve the predictive performance of a model if 
the dataset contains a large number of irrelevant features (or noise), that is, if the 
dataset has a low signal-to-noise ratio. 


To determine whether our machine learning algorithm not only performs well on the 
training set but also generalizes well to new data, we also want to randomly divide 
the dataset into a separate training and test set. We use the training set to train and 
optimize our machine learning model, while we keep the test set until the very end to 
evaluate the final model. 
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Training and selecting a predictive model 


As we will see in later chapters, many different machine learning algorithms have 
been developed to solve different problem tasks. An important point that can be 
summarized from David Wolpert's famous No free lunch theorems 1s that we can't 
get learning "for free" (The Lack of A Priori Distinctions Between Learning 
Algorithms, D.H. Wolpert 1996; No free lunch theorems for optimization, D.H. 
Wolpert and W.G. Macready, 1997). Intuitively, we can relate this concept to the 
popular saying, / suppose it is tempting, if the only tool you have is a hammer, to 
treat everything as if it were a nail (Abraham Maslow, 1966). For example, each 
classification algorithm has its inherent biases, and no single classification model 
enjoys superiority if we don't make any assumptions about the task. In practice, it is 
therefore essential to compare at least a handful of different algorithms in order to 
train and select the best performing model. But before we can compare different 
models, we first have to decide upon a metric to measure performance. One 
commonly used metric is classification accuracy, which 1s defined as the proportion 
of correctly classified instances. 


One legitimate question to ask is this: how do we know which model performs well 
on the final test dataset and real-world data if we don't use this test set for the model 
selection, but keep it for the final model evaluation? In order to address the issue 
embedded in this question, different cross-validation techniques can be used where 
the training dataset is further divided into training and validation subsets in order to 
estimate the generalization performance of the model. Finally, we also cannot expect 
that the default parameters of the different learning algorithms provided by software 
libraries are optimal for our specific problem task. Therefore, we will make frequent 
use of hyperparameter optimization techniques that help us to fine-tune the 
performance of our model in later chapters. Intuitively, we can think of those 
hyperparameters as parameters that are not learned from the data but represent the 
knobs of a model that we can turn to improve its performance. This will become 
much clearer in later chapters when we see actual examples. 
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Evaluating models and predicting unseen data 
instances 


After we have selected a model that has been fitted on the training dataset, we can 
use the test dataset to estimate how well it performs on this unseen data to estimate 
the generalization error. If we are satisfied with its performance, we can now use this 
model to predict new, future data. It is important to note that the parameters for the 
previously mentioned procedures, such as feature scaling and dimensionality 
reduction, are solely obtained from the training dataset, and the same parameters are 
later reapplied to transform the test dataset, as well as any new data samples—the 
performance measured on the test data may be overly optimistic otherwise. 
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Using Python for machine learning 


Python is one of the most popular programming languages for data science and 
therefore enjoys a large number of useful add-on libraries developed by its great 
developer and and open-source community. 


Although the performance of interpreted languages, such as Python, for 
computation-intensive tasks 1s inferior to lower-level programming languages, 
extension libraries such as NumPy and SciPy have been developed that build upon 
lower-layer Fortran and C implementations for fast and vectorized operations on 
multidimensional arrays. 


For machine learning programming tasks, we will mostly refer to the scikit-learn 
library, which is currently one of the most popular and accessible open source 
machine learning libraries. 
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Installing Python and packages from the Python 
Package Index 


Python is available for all three major operating systems—Microsoft Windows, 
macOS, and Linux—and the installer, as well as the documentation, can be 
downloaded from the official Python website: https://www.python.org. 


This book is written for Python version 3.5.2 or higher, and it is recommended you 
use the most recent version of Python 3 that is currently available, although most of 
the code examples may also be compatible with Python 2.7.13 or higher. If you 
decide to use Python 2.7 to execute the code examples, please make sure that you 
know about the major differences between the two Python versions. A good 
summary of the differences between Python 3.5 and 2.7 can be found at 


https://wik1.python.org/moin/Python2orPython3. 


The additional packages that we will be using throughout this book can be installed 
via the pip installer program, which has been part of the Python standard library 
since Python 3.3. More information about pip can be found at 


https://docs.python.org/3/installing/index.html. 


After we have successfully installed Python, we can execute pip from the Terminal 
to install additional Python packages: 


pip install SomePackage 


Already installed packages can be updated via the --upgrade flag: 


pip install SomePackage --upgrade 
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Using the Anaconda Python distribution and 
package manager 


A highly recommended alternative Python distribution for scientific computing 1s 
Anaconda by Continuum Analytics. Anaconda is a free—including for commercial 
use—enterprise-ready Python distribution that bundles all the essential Python 
packages for data science, math, and engineering 1n one user-friendly cross-platform 
distribution. The Anaconda installer can be downloaded at 
http://continuum.io/downloads, and an Anaconda quick-start guide is available at 


https://conda.10/docs/test-drive.html. 


After successfully installing Anaconda, we can install new Python packages using 
the following command: 


conda install SomePackage 


Existing packages can be updated using the following command: 


conda update SomePackage 
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Packages for scientific computing, data science, 
and machine learning 


Throughout this book, we will mainly use NumPy's multidimensional arrays to store 
and manipulate data. Occasionally, we will make use of pandas, which is a library 
built on top of NumPy that provides additional higher-level data manipulation tools 
that make working with tabular data even more convenient. To augment our learning 
experience and visualize quantitative data, which is often extremely useful to 
intuitively make sense of it, we will use the very customizable Matplotlib library. 


The version numbers of the major Python packages that were used for writing this 
book are mentioned in the following list. Please make sure that the version numbers 
of your installed packages are equal to, or greater than, those version numbers to 
ensure the code examples run correctly: 


NumPy 1.12.1 
SciPy 0.19.0 
scikit-learn 0.18.1 
Matplotlib 2.0.2 
pandas 0.20.1 
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Summary 


In this chapter, we explored machine learning at a very high level and familiarized 
ourselves with the big picture and major concepts that we are going to explore in the 
following chapters in more detail. We learned that supervised learning 1s composed 
of two important subfields: classification and regression. While classification models 
allow us to categorize objects into known classes, we can use regression analysis to 
predict the continuous outcomes of target variables. Unsupervised learning not only 
offers useful techniques for discovering structures in unlabeled data, but it can also 
be useful for data compression 1n feature preprocessing steps. We briefly went over 
the typical roadmap for applying machine learning to problem tasks, which we will 
use as a foundation for deeper discussions and hands-on examples in the following 
chapters. Eventually, we set up our Python environment and installed and updated 
the required packages to get ready to see machine learning 1n action. 


Later in this book, in addition to machine learning itself, we will also introduce 
different techniques to preprocess our dataset, which will help us to get the best 
performance out of different machine learning algorithms. While we will cover 
classification algorithms quite extensively throughout the book, we will also explore 
different techniques for regression analysis and clustering. 


We have an exciting journey ahead, covering many powerful techniques in the vast 
field of machine learning. However, we will approach machine learning one step at a 
time, building upon our knowledge gradually throughout the chapters of this book. 
In the following chapter, we will start this journey by implementing one of the 
earliest machine learning algorithms for classification, which will prepare us for 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, where we 
cover more advanced machine learning algorithms using the scikit-learn open source 
machine learning library. 
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Chapter 2. Training Simple Machine 
Learning Algorithms for Classification 


In this chapter, we will make use of two of the first algorithmically described 
machine learning algorithms for classification, the perceptron and adaptive linear 
neurons. We will start by implementing a perceptron step by step in Python and 
training it to classify different flower species in the Iris dataset. This will help us 
understand the concept of machine learning algorithms for classification and how 
they can be efficiently implemented in Python. 


Discussing the basics of optimization using adaptive linear neurons will then lay the 
eroundwork for using more powerful classifiers via the scikit-learn machine learning 
library in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. 


The topics that we will cover in this chapter are as follows: 


e Building an intuition for machine learning algorithms 
e Using pandas, NumPy, and Matplotlib to read in, process, and visualize data 
e Implementing linear classification algorithms in Python 
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Artificial neurons — a brief glimpse into 
the early history of machine learning 


Before we discuss the perceptron and related algorithms in more detail, let us take a 
brief tour through the early beginnings of machine learning. Trying to understand 
how the biological brain works, 1n order to design AI, Warren McCullock and 
Walter Pitts published the first concept of a simplified brain cell, the so-called 
McCullock-Pitts (MCP) neuron, in 1943 (A Logical Calculus of the Ideas 
Immanent in Nervous Activity, W. S. McCulloch and W. Pitts, Bulletin of 
Mathematical Biophysics, 5(4): 115-133, 1943). Neurons are interconnected nerve 
cells in the brain that are involved in the processing and transmitting of chemical and 
electrical signals, which 1s illustrated in the following figure: 
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McCullock and Pitts described such a nerve cell as a simple logic gate with binary 
outputs; multiple signals arrive at the dendrites, are then integrated into the cell 
body, and, if the accumulated signal exceeds a certain threshold, an output signal is 
generated that will be passed on by the axon. 


Only a few years later, Frank Rosenblatt published the first concept of the perceptron 
learning rule based on the MCP neuron model (The Perceptron: A Perceiving and 
Recognizing Automaton, F’. Rosenblatt, Cornell Aeronautical Laboratory, 1957). 
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With his perceptron rule, Rosenblatt proposed an algorithm that would automatically 
learn the optimal weight coefficients that are then multiplied with the input features 
in order to make the decision of whether a neuron fires or not. In the context of 
supervised learning and classification, such an algorithm could then be used to 
predict 1f a sample belongs to one class or the other. 
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The formal definition of an artificial neuron 


More formally, we can put the idea behind artificial neurons into the context of a 
binary classification task where we refer to our two classes as | (positive class) and 


-1 (negative class) for simplicity. We can then define a decision function (P| z) ) 


that takes a linear combination of certain input values x and a corresponding weight 
: eo WA ore WX 
vector w, where z is the so-called net input si salient 


W, 


w=|: |, x= 


W x 


He HT 


wt) 
Now, if the net input of a particular sample ** is greater than a defined threshold 


6 , we predict class /, and class -/ otherwise. In the perceptron algorithm, the 


decision function ( ) is a variant of a unit step function: 


| L220 


#(2)- 


| —l otherwise 


For simplicity, we can bring the threshold 6 to the left side of the equation and 
=—ff 


Wy 


, xX, =1 oe 
define a weight-zero as and so that we write z in a more compact 


form: 


Z=WyX, t+ wx t+... bw, =Ww Xx 
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| l ifz20 


|-1 otherwise 


#(2)- 


In machine learning literature, the negative threshold, or weight, oi Saale , 1s usually 
called the bias unit. 


Note 


In the following sections, we will often make use of basic notations from linear 
algebra. For example, we will abbreviate the sum of the products of the values in x 
and w using a vector dot product, whereas superscript / stands for transpose, which 
is an operation that transforms a column vector into a row vector and vice versa: 


rT 


Se a ae se | ee ee 
Z=WXy TWX, tet WX, = rei x, W =wWw Xx 


For example: 


Furthermore, the transpose operation can also be applied to matrices to reflect it over 
its diagonal, for example: 
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{1 3 5 
12 4 6 
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In this book, we will only use very basic concepts from linear algebra; however, if 
you need a quick refresher, please take a look at Zico Kolter's excellent Linear 
Algebra Review and Reference, which 1s freely available at 


http://www.cs.cmu.edu/~zkolter/course/linalg/linalg notes.pdf. 


= — i e e ° 
The following figure illustrates how the net input =< —  * is squashed into a binary 
output (-1 or 1) by the decision function of the perceptron (left subfigure) and how it 
can be used to discriminate between two linearly separable classes (right subfigure): 
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The perceptron learning rule 


The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron 
model is to use a reductionist approach to mimic how a single neuron in the brain 
works: it either fires or it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly 
simple and can be summarized by the following steps: 


1. Initialize the weights to 0 or small random numbers. 


A?) a 
2. For each training sample ** : a. Compute the output value ~ . b. Update the 


weights. 


Here, the output value is the class label predicted by the unit step function that we 


, | Ww: | 
defined earlier, and the simultaneous update of each weight /“ in the weight vector 
w can be more formally written as: 





WwW, =w, + AW 


Aw, MW’. 
The value of “ , which is used to update the weight “, is calculated by the 
perceptron learning rule: 


. coed alt SOO), At) 
Aw, =17 (5 y )a 


(ih 


Where ”/ is the learning rate (typically a constant between 0.0 and 1.0), ’ is the 


~ (i) 
true class label of the ith training sample, and - _is the predicted class label. It is 
important to note that all weights in the weight vector are being updated 


~ (i) 
simultaneously, which means that we don't recompute the - _ before all of the 
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Aw, 
weights “ are updated. Concretely, for a two-dimensional dataset, we would 
write the update as: 


Aw, = n( y" ! — output” 


Lr) 


Aw, — 1] y”? — output” xX 


(7 


Aw, = ( vy) — output” Xs 


— 


Before we implement the perceptron rule in Python, let us make a simple thought 
experiment to illustrate how beautifully simple this learning rule really is. In the two 


scenarios where the perceptron predicts the class label correctly, the weights remain 
unchanged: 


en a ae 


However, 1n the case of a wrong prediction, the weights are being pushed towards 
the direction of the positive or negative target class: 
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ee ee x, 
To get a better intuition for the multiplicative factor / , let us go through another 
simple example, where: 


y =+1, y, =-l, n=! 


F 
x, =O . ; . 
Let's assume that =‘ , and we misclassify this sample as -/. In this case, we 
? e 
. . xX. Ww. 
would increase the corresponding weight by | so that the net input / ' would 
be more positive the next time we encounter this sample, and thus be more likely to 
be above the threshold of the unit step function to classify the sample as +/: 


Aw, =(1--1)0.5 =(2)0.5=1 


1i| 
The weight update is proportional to the value of / . For example, if we have 
x. =2 
another sample / that is incorrectly classified as -7, we'd push the decision 
boundary by an even larger extent to classify this sample correctly the next time: 
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Aw, =(1--1)2= (2)2=4 


It is important to note that the convergence of the perceptron 1s only guaranteed if 
the two classes are linearly separable and the learning rate 1s sufficiently small. If the 
two classes can't be separated by a linear decision boundary, we can set a maximum 
number of passes over the training dataset (epochs) and/or a threshold for the 
number of tolerated misclassifications—the perceptron would never stop updating 
the weights otherwise: 


Linearly separable Not linearly separable Not linearly separable 





Note 


Downloading the example code 


If you bought this book directly from Packt, you can download the example code 
files from your account at http://www.packtpub.com. If you purchased this book 
elsewhere, you can download all code examples and datasets directly from 


https://github.com/rasbt/python-machine-learning-book-2nd-edition. 


Now, before we jump into the implementation in the next section, let us summarize 
what we just learned in a simple diagram that illustrates the general concept of the 
perceptron: 
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Weight update 


Error 


>E)HL ow 


Net input Threshold 
function function 





The preceding diagram illustrates how the perceptron receives the inputs of a sample 
x and combines them with the weights w to compute the net input. The net input 1s 
then passed on to the threshold function, which generates a binary output -1 or +1— 
the predicted class label of the sample. During the learning phase, this output is used 
to calculate the error of the prediction and update the weights. 
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Implementing a perceptron learning 
algorithm in Python 


In the previous section, we learned how the Rosenblatt's perceptron rule works; let 
us now go ahead and implement it in Python, and apply it to the Iris dataset that we 
introduced in Chapter 1, Giving Computers the Ability to Learn from Data. 
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An object-oriented perceptron API 


We will take an object-oriented approach to define the perceptron interface as a 
Python class, which allows us to initialize new Perceptron objects that can learn 
from data via a £it method, and make predictions via a separate predict method. 
As a convention, we append an underscore (_) to attributes that are not being created 
upon the initialization of the object but by calling the object's other methods, for 
example, self.w . 


Note 


If you are not yet familiar with Python's scientific libraries or need a refresher, please 
see the following resources: 


e NumPy: https://sebastianraschka.com/pdf/books/dlb/appendix_f numpy- 
intro.pdf 

e pandas: https://pandas.pydata.org/pandas-docs/stable/10min.html 

e Matplotlib: http://matplotlib.org/users/beginner.html 


The following is the implementation of a perceptron: 


import numpy as np 


class Perceptron(object) : 
"""Perceptron classifier. 


Parameters 
eta : float 
Learning rate (between 0.0 and 1.0) 
i ieee 3 ae 
Passes over the training dataset. 
random State = Int 
Random number generator seed for random weight 
initialization. 


Attributes 
w  : ld-array 
Weights after fitting. 
e22ors = 2ISe 
Number of misclassifications (updates) in each epoch. 


Wwe vy 
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Cet. 2m sell, Sle-U.0), 1. 166r-50;, Banoo Seate=1): 
Pel iseca = era 
Selita 2eCe = her 
SeLTerandom Stace = Fandom Suate 


def fit(self, X, y): 
VUE Le. traning data. 


Parameters 

x & Tearray=Like|,; shape = [ni Samples; 1 teacures| 
Trelnincg Vectors, where ff. Safiples. 1s tne Number of 
samples and 
In Teecures 2S Ene number of features. 

Y ¢ G@rray-like, Shape = [nn samples) 
Target values. 


Returns 


self : object 


wees 


CoOCn = Np.rangom. Rancomolale(Sselt.ranoom stave) 
Sseliew = 196n.nOrmal(loc-0.0, Scale—-0.01, 

Ssize=l1 + X.shape[1]) 
SeLiwerrors =. |] 


fOr dim range (seli.n 2.ter) >; 
errors = 0 
fOr Xi, Cargerl nh Zip (xX, yy< 
update = self.eta * (target - self.predict (x1) ) 
Sselt.w [ie] += Updace * xi 
self.w [0] += update 
errors += int(update != 0.0) 
DeLee rVOLre. «app enG (errors) 
return self 


Get New 1npuc(seit; x); 
re Ca Loulace Net. ampucl*™ 
fecturn Np.dor(x; Selt.w [lily + Seltsw 10] 


def predict(self, X): 
"""Return class label after unit step""" 
PECUrM NDP.where(selLt.mert anpue(x) 2 0.0, 1, =) 


Using this perceptron implementation, we can now initialize new Perceptron 


objects with a given learning rate eta and n iter, which is the number of epochs 
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(passes over the training set). Via the £it method, we initialize the weights in 


self.w toa vector R"™ , where m stands for the number of dimensions (features) 
in the dataset, where we add / for the first element in this vector that represents the 

bias unit. Remember that the first element in this vector, self.w [0], represents the 
so-called bias unit that we discussed earlier. 


Also notice that this vector contains small random numbers drawn from a normal 
distribution with standard deviation 0.01 via rgen.normal (loc=0.0, scale=0.01, 
size=l1 + X.shape[1]), where rgen 1s a NumPy random number generator that we 
seeded with a user-specified random seed so that we can reproduce previous results 
if desired. 


Now, the reason we don't initialize the weights to zero is that the learning rate " 
(eta) only has an effect on the classification outcome if the weights are initialized to 
non-zero values. If all the weights are initialized to zero, the learning rate parameter 
eta affects only the scale of the weight vector, not the direction. If you are familiar 


vi=[1 2 3] 


with trigonometry, consider a vector , where the angle between Vl anda 


y2=0.5x vl 


vector would be exactly zero, as demonstrated by the following code 


snippet: 

Por Vi = MOsaerray (il, 27 3) ) 

Sy S05 Se a 

2>> Np.arccos(vl.dovcivZ) 7 inp.iinelo.normivy)) * 
eas np.linalg.norm(v2) ) ) 

0.0 
Here, np.arccos 1s the trigonometric inverse cosine and np.linalg.norm1s a 
function that computes the length of a vector. (The reason why we have drawn the 
random numbers from a random normal distribution—for example, instead from a 
uniform distribution—and why we used a standard deviation of 0.01 was arbitrary; 
remember, we are just interested in small random values to avoid the properties of 
all-zero vectors as discussed earlier.) 


Note 


NumPy indexing for one-dimensional arrays works similarly to Python lists using 
the square-bracket ([]) notation. For two-dimensional arrays, the first indexer refers 
to the row number and the second indexer to the column number. For example, we 
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would use x[2, 3] to select the third row and fourth column of a two-dimensional 
array X. 


After the weights have been initialized, the £it method loops over all individual 
samples in the training set and updates the weights according to the perceptron 
learning rule that we discussed in the previous section. The class labels are predicted 
by the predict method, which 1s called in the £it method to predict the class label 
for the weight update, but predict can also be used to predict the class labels of new 
data after we have fitted our model. Furthermore, we also collect the number of 
misclassifications during each epoch in the self.errors_ list so that we can later 
analyze how well our perceptron performed during the training. The np.dot function 


7 
that is used in the net input method simply calculates the vector dot product ” * . 


Note 


Instead of using NumPy to calculate the vector dot product between two arrays a and 
b Vla a.dot (b) Or np.dot (a, b), we could also perform the calculation in pure 
Python via sum([j * j for i, j in zip(a, b)]). However, the advantage of 
using NumPy over classic Python for loop structures is that its arithmetic operations 
are vectorized. Vectorization means that an elemental arithmetic operation 1s 
automatically applied to all elements in an array. By formulating our arithmetic 
Operations as a sequence of instructions on an array, rather than performing a set of 
operations for each element at the time, we can make better use of our modern CPU 
architectures with Single Instruction, Multiple Data (SIMD) support. Furthermore, 
NumPy uses highly optimized linear algebra libraries such as Basic Linear Algebra 
Subprograms (BLAS) and Linear Algebra Package (LAPACK) that have been 
written in C or Fortran. Lastly, NumPy also allows us to write our code in a more 
compact and intuitive way using the basics of linear algebra, such as vector and 
matrix dot products. 
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Training a perceptron model on the Iris dataset 


To test our perceptron implementation, we will load the two flower classes Setosa 
and Versicolor from the Iris dataset. Although the perceptron rule is not restricted to 
two dimensions, we will only consider the two features sepal length and petal length 
for visualization purposes. Also, we only chose the two flower classes Setosa and 
Versicolor for practical reasons. However, the perceptron algorithm can be extended 
to multi-class classification—for example, the One-versus-All (OvA) technique. 


Note 


OvA, or sometimes also called One-versus-Rest (OvR), is a technique that allows 
us to extend a binary classifier to multi-class problems. Using OvA, we can train one 
classifier per class, where the particular class 1s treated as the positive class and the 
samples from all other classes are considered negative classes. If we were to classify 
a new data sample, we would use our n classifiers, where n 1s the number of class 
labels, and assign the class label with the highest confidence to the particular sample. 
In the case of the perceptron, we would use OvA to choose the class label that 1s 
associated with the largest absolute net input value. 


First, we will use the pandas library to load the Iris dataset directly from the UCT 
Machine Learning Repository into a DataFrame object and print the last five lines via 
the tail method to check the data was loaded correctly: 


>>> import pandas as pd 

>>> di = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases/iris/iris.data', 
as header=None) 

>>> df.tail () 
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Note 


You can find a copy of the Iris dataset (and all other datasets used 1n this book) in the 
code bundle of this book, which you can use if you are working offline or the UCI 
server at https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data is 
temporarily unavailable. For instance, to load the Iris dataset from a local directory, 
you can replace this line: 





di = pd.read csv ('https://archive.ics.uci.edu/ml/" 
'machine-learning-databases/iris/iris.data', 
header=None) 


Replace it with this: 


di = pd.read cosy (*your/lo0cal/path/lo/iris.data’, 
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header=None) 


Next, we extract the first 100 class labels that correspond to the 50 Iris-setosa and 
50 Iris-versicolor flowers, and convert the class labels into the two integer class 
labels 1 (versicolor) and -1 (setosa) that we assign to a vector y, where the values 
method of a pandas DataFrame yields the corresponding NumPy representation. 


Similarly, we extract the first feature column (sepal length) and the third feature 
column (petal length) of those 100 training samples and assign them to a feature 
matrix x, which we can visualize via a two-dimensional scatter plot: 


Pee 
>>> 


Poe 
>>> 
>>> 


>>> 
>>> 


>>> 
>>> 


Pee 
>>> 
>>> 


>>> 
>>> 


import matplotlib.pyplot as plt 
import numpy as np 


# select setosa and versicolor 
y = Gf2110c (02100, 4].values 
y = np.where(y == 'Iris-setosa', -1, 1) 


# extract sepal length and petal length 
x = OF,2..loc([0. 100, (0, 2) )«<values 


# plot data 
Plt, scatter (%[250, Ul, <LeoU, Ll, 
color='red', marker='o', label='setosa') 
pLt.scatter (x/50:100, 0], X[Ss0:100, 1], 
color='blue', marker='x', label='versicolor') 
plt.xlabel('sepal length [cm]') 
plt.ylabel (‘petal length [cm]') 
plt.legend(loc='upper left") 
plt.show () 


After executing the preceding code example, we should now see the following 
scatterplot: 
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The preceding scatterplot shows the distribution of flower samples in the Iris dataset 
along the two feature axes, petal length and sepal length. In this two-dimensional 
feature subspace, we can see that a linear decision boundary should be sufficient to 
separate Setosa from Versicolor flowers. Thus, a linear classifier such as the 
perceptron should be able to classify the flowers in this dataset perfectly. 


Now, it's time to train our perceptron algorithm on the Iris data subset that we just 
extracted. Also, we will plot the misclassification error for each epoch to check 
whether the algorithm converged and found a decision boundary that separates the 
two Iris flower classes: 


>>> on 


>>> ppn. 


>>> pit 


>> ple 


= PEerCepLlron(ecta=0.1, 1 1ter= 10) 
fit(X, y) 


sPLOU (Ganga (ly em(ppm.errOrs ) + J)y 


Ppns.errors » Marker=" O°) 


.Xxlabel ('Epochs') 
ao Pits 
PO OV. 


ylabel ('Number of updates') 
show () 
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After executing the preceding code, we should see the plot of the misclassification 
errors versus the number of epochs, as shown here: 
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As we can see in the preceding plot, our perceptron converged after the sixth epoch 
and should now be able to classify the training samples perfectly. Let us implement a 
small convenience function to visualize the decision boundaries for two-dimensional 
datasets: 


from matplotlib.colors import ListedColormap 
Get DlLol CSecis10n tregions(~4, Vr Chessitver, roesolucion—0.02)% 


# setup marker generator and color map 


markers = (eo. ae our aol et) 

colors = ('red', ‘blue', ‘lightgreen', ‘'gray', '‘'cyan') 

cmap = ListedColormap(colors[:len(np.unique(y))]) 

# plot the decision surface 

xl Min, XL Max = X[%, O].man{) = 1, Als, O].meax() + 1 
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XZ Min, XZ mex = Ris, J) <mi) = 1, Alley tiemeax() + 2 


xkly KAZ = NOsmMeshorio (npsarange (x! min, xl Max, resolution), 

np.arange(x2 min, x2 max, resolution) ) 
Z = classifier.predict(np.array([xxl.ravel(), xx2.ravel()]).T) 
Z = Z4.reshape(xxl.shape) 


plt.contourf (xxl, xx2, Z, alpha=0.3, cmap=cmap) 
plt.xlim(xxl.min(), xxl.max() ) 
pilt.ylim(xx2.min(), xx2.max() ) 


# plot class samples 
for idx, cl in enumerate (np.unique(y)): 


plt.scatter(x=X[y == cl, Ol, 
y=X[y == cl, ll, 
alpha=0.8, 


c=colors[idx], 
marker=markers[1idx], 
label=cl, 
edgecolor='"black') 


First, we define a number of colors and markers and create a colormap from the list 
of colors via ListedColormap. Then, we determine the minimum and maximum 
values for the two features and use those feature vectors to create a pair of grid 
arrays xx1 and xx2 via the NumPy meshgrid function. Since we trained our 
perceptron classifier on two feature dimensions, we need to flatten the grid arrays 
and create a matrix that has the same number of columns as the Iris training subset so 
that we can use the predict method to predict the class labels z of the corresponding 
erid points. 


After reshaping the predicted class labels z into a grid with the same dimensions as 
xxl and xx2, We can now draw a contour plot via Matplotlib's contourf function, 
which maps the different decision regions to different colors for each predicted class 
in the grid array: 

Por pPLOl CScCis10m. Teq1 Ons (x, VY, Classitier—ppn) 

>>> plt.xlabel('sepal length [cm]') 

>>> plt.ylabel (‘petal length [cm]') 


>>> plt.legend(loc='upper left') 
>>> plt.show() 


After executing the preceding code example, we should now see a plot of the 
decision regions, as shown in the following figure: 
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As we can see in the plot, the perceptron learned a decision boundary that is able to 
classify all flower samples in the Iris training subset perfectly. 


Note 


Although the perceptron classified the two Iris flower classes perfectly, convergence 
is one of the biggest problems of the perceptron. Frank Rosenblatt proved 
mathematically that the perceptron learning rule converges if the two classes can be 
separated by a linear hyperplane. However, if classes cannot be separated perfectly 
by such a linear decision boundary, the weights will never stop updating unless we 
set a maximum number of epochs. 
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Adaptive linear neurons and the 
convergence of learning 


In this section, we will take a look at another type of single-layer neural network: 
ADAptive LInear NEuron (Adaline). Adaline was published by Bernard Widrow 
and his doctoral student Tedd Hoff, only a few years after Frank Rosenblatt's 
perceptron algorithm, and can be considered as an improvement on the latter. (Refer 
to An Adaptive "Adaline" Neuron Using Chemical "Memistors", Technical Report 
Number 1553-2, B. Widrow and others, Stanford Electron Labs, Stanford, CA, 
October 1960). 


The Adaline algorithm is particularly interesting because it illustrates the key 
concepts of defining and minimizing continuous cost functions. This lays the 
groundwork for understanding more advanced machine learning algorithms for 
classification, such as logistic regression, support vector machines, and regression 
models, which we will discuss in future chapters. 


The key difference between the Adaline rule (also known as the Widrow-Hoff rule) 
and Rosenblatt's perceptron is that the weights are updated based on a linear 
activation function rather than a unit step function like in the perceptron. In Adaline, 
this linear activation function p (2 ) 
so that: 


b(w’ x) —w x 


is simply the identity function of the net input, 


While the linear activation function is used for learning the weights, we still use a 
threshold function to make the final prediction, which 1s similar to the unit step 
function that we have seen earlier. The main differences between the perceptron and 
Adaline algorithm are highlighted in the following figure: 
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The illustration shows that the Adaline algorithm compares the true class labels with 
the linear activation function's continuous valued output to compute the model error 
and update the weights. In contrast, the perceptron compares the true class labels to 
the predicted class labels. 
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Minimizing cost functions with gradient descent 


One of the key ingredients of supervised machine learning algorithms is a defined 
objective function that is to be optimized during the learning process. This objective 
function 1s often a cost function that we want to minimize. In the case of Adaline, we 


can define the cost function / to learn the weights as the Sum of Squared Errors 
(SSE) between the calculated outcome and the true class label: 


(m)=5¥,(»"-9(2")) 


| 


The term is just added for our convenience, which will make it easier to derive 
the gradient, as we will see in the following paragraphs. The main advantage of this 
continuous linear activation function, in contrast to the unit step function, is that the 
cost function becomes differentiable. Another nice property of this cost function 1s 
that it is convex; thus, we can use a simple yet powerful optimization algorithm 
called gradient descent to find the weights that minimize our cost function to 
classify the samples 1n the Iris dataset. 


As illustrated in the following figure, we can describe the main idea behind gradient 
descent as climbing down a hill until a local or global cost minimum is reached. In 
each iteration, we take a step in the opposite direction of the gradient where the step 
size 1s determined by the value of the learning rate, as well as the slope of the 
gradient: 
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Initial Gradient 
—— 


weight \ 


“ Global cost minimum 
— Jimin(W) 





Using gradient descent, we can now update the weights by taking a step in the 


J(w). 


; Vil(w 
opposite direction of the gradient (") of our cost function 


wi—w+Aw 





Where the weight change AW igs defined as the negative gradient multiplied by the 


learning rate 7 


Aw =—nVJ(w) 


To compute the gradient of the cost function, we need to compute the partial 


— | | Ww, 
derivative of the cost function with respect to each weight °: 
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. . Ww 
So that we can write the update of weight ~“ as: 


Aw, =-n& oy, - 9(2"”) e'? 





Since we update all weights simultaneously, our Adaline learning rule becomes: 


wi—w+Aw 


Note 


For those who are familiar with calculus, the partial derivative of the SSE cost 
function with respect to the jth weight can be obtained as follows: 


oO] sis 0 £ . yi -9(2)) 
= ,. cme! | * i Fy 


Ow, Ow. 
r ll 





. = z (y-9(2” )) 
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5 D2(v"-o(2"))x(2"-02") 


= (y - (2"”)) — So (wa!) 


Although the Adaline learning rule looks identical to the perceptron rule, we should 


f (?) ” 
note that the (: with WX” is a real number and not an integer class label. 
Furthermore, the weight update is calculated based on all samples in the training set 
(instead of updating the weights incrementally after each sample), which 1s why this 
approach is also referred to as batch gradient descent. 
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Implementing Adaline in Python 


Since the perceptron rule and Adaline are very similar, we will take the perceptron 
implementation that we defined earlier and change the £it method so that the 
weights are updated by minimizing the cost function via gradient descent: 


class AdalineGD(object): 
""MADAptive Linear NEuron classifier. 


Paramelers 
eta : float 
Learning rate (between 0.0 and 1.0) 
ie eet. ee See 
Passes over the training dataset. 
Fandom State 7 2nt 
Random number generator seed for random weight 
initialization. 


Attributes 
ws 2S 
Weights after fitting. 
Goose. = ASE 
Sum-of-squares cost function value in each epoch. 


wesw 


Oct Ane. <selr, ecba-U.0l, m2ter=50, fandom svate=1) 
Sella = Sue 
Selita, 2ter = i 2er 
Seli.srandom Stare = random State 


def fit(self, X, y): 
vey Pee. ChaiIminGg Gata. 


Parameters 

X= {array-lLike}, shape = [n samples, n. features] 
Training vectors, where n samples is the number of 
samples and 
nn teacures 1S the: number Of Tearures.. 

y : array-like, shape = [n samples] 
Target values. 


Returns 
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self : object 


roecnh = Np.rangom, Rancomotate (sell.ranoom state) 
Sscli.w. = 1gen.normel (loc=0.0;, SGa.e=0.01, 

Ssize=l + X.shape[1]) 
SeleeCost. = 1] 


fOr 4. 2 range (Selt.nm ater): 
nee. 1NpuL = -selt.net. anput (x) 
OUTpPUL = SelTr.aclivat1on (met 2nput) 
errors = (y - output) 
Seliew_|.u2], += Sseliveta. ™ Al .,O0e (errors) 
Sselizw [0] += Seli.,eta * errors.sum() 
Cost = ferrors**2).sumt) 7 2.0 
SeLEeCOst, sappena(Coct) 

return self 


OSE Net. Inpurctselt, xX): 
"r"Calculate net input" 
feoturn 2p.doLr(x, Selicw bet) + seliaw [04 


def activation(self, X): 
munCompute linear activation"™"™" 
return X 


def predict(self, X): 
"" "Return class label after unit step""™" 
Pectin, Mp.«where(seltT..acuivalion(selt.nee tnpeul (x) 
2= 0.0, Ty =1) 


Instead of updating the weights after evaluating each individual training sample, as 
in the perceptron, we calculate the gradient based on the whole training dataset via 
self.eta * errors.sum() for the bias unit (zero-weight) and via self.eta * 
X.T.dot (errors) for the weights | to m where X.T.dot (errors) 18 a matrix-vector 
multiplication between our feature matrix and the error vector. 


Please note that the activation method has no effect in the code since it is simply 
an identity function. Here, we added the activation function (computed via the 
activation method) to illustrate how information flows through a single layer 
neural network: features from the input data, net input, activation, and output. In the 
next chapter, we will learn about a logistic regression classifier that uses a non- 
identity, nonlinear activation function. We will see that a logistic regression model is 
closely related to Adaline with the only difference being its activation and cost 
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function. 


Now, similar to the previous perceptron implementation, we collect the cost values 
in aself.cost_ list to check whether the algorithm converged after training. 


Note 


Performing a matrix-vector multiplication is similar to calculating a vector dot- 
product where each row in the matrix is treated as a single row vector. This 
vectorized approach represents a more compact notation and results in a more 
efficient computation using NumPy. For example: 


| 


] 
4 


7 
* 


6 | ld ap 


5() 


22 


Ix /+2x86+45x9 


4x /4+5x84+6x9 


Lil [onl 














) 


In practice, it often requires some experimentation to find a good learning rate T for 
optimal convergence. So, let's choose two different learning rates, n = 0.1 and 


7 = 0.9001 , to start with and plot the cost functions versus the number of epochs to 
see how well the Adaline implementation learns from the training data. 


Note 


The learning rate 7 (eta), aS well as the number of epochs (n iter), are the so- 
called hyperparameters of the perceptron and Adaline learning algorithms. In 
Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter 
Tuning, we will take a look at different techniques to automatically find the values of 
different hyperparameters that yield optimal performance of the classification model. 


Let us now plot the cost against the number of epochs for the two different learning 
rates: 


>>> fig, ax = plt.subplots (nrows=1, ncols=2, figsize=(10, 4)) 
por goal = AGCalaneGD (ni 1ter=—10, SCa=0.0l) sTic(x, y) 


Por axl Ol «plot (range (1, Jen(ledal.cost } = I), 
np.loglQ(adal.cost ), marker='o') 
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Po> axl) | -<seu. xlabel (*mpochs”) 
Per ak) ) .Seu ylLabel.(* log (oum=squared-error) *) 
yer ex) seer. Cite Adaline = Learning tare 0.01") 


eo adeZ = AdalinecpD (in 2ter=I0, Sca=0.0001) .t20(x, y) 
por ox Ll «LOC (eange (l, bem(edaz.cost )} + L)y 

£28 adaZ.COst. , Marker="0" ) 

Por @x| | geet Klavpe:t” EpocisS } 

Pee |) geet. Ylave. DUM sequareor-error) 

Jo? GaxXli | sset_CLele("Adalane = earning 2ece 0.00017) 
>>> plt.show() 


As we can see in the resulting cost-function plots, we encountered two different 
types of problem. The left chart shows what could happen if we choose a learning 
rate that 1s too large. Instead of minimizing the cost function, the error becomes 
larger in every epoch, because we overshoot the global minimum. On the other hand, 
we can see that the cost decreases on the right plot, but the chosen learning rate 


= 9.0001 is so small that the algorithm would require a very large number of 
epochs to converge to the global cost minimum: 


Adaline - Learning rate 0.01 Adaline - Learning rate 0.0001 
30 4 a) a ) 


hs hs 
© Ln 
Sum-squared-error 


log(Sum-squared-error) 
IH 4 
So wn 





Epochs Epochs 





The following figure illustrates what might happen 1f we change the value of a 


particular weight parameter to minimize the cost function / The left subfigure 
illustrates the case of a well-chosen learning rate, where the cost decreases gradually, 
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moving in the direction of the global minimum. The subfigure on the right, however, 
illustrates what happens if we choose a learning rate that 1s too large—we overshoot 
the global minimum: 


Initial 
weight 


i Gradient J(w) 






Global cost minimum 
Jinin(W) 
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Improving gradient descent through feature 
scaling 


Many machine learning algorithms that we will encounter throughout this book 
require some sort of feature scaling for optimal performance, which we will discuss 
in more detail in Chapter 3, A Tour of Machine Learning Classifiers Using scikit- 
learn and Chapter 4, Building Good Training Sets — Data Preprocessing. 


Gradient descent is one of the many algorithms that benefit from feature scaling. In 
this section, we will use a feature scaling method called standardization, which 
gives our data the property of a standard normal distribution, which helps gradient 
descent learning to converge more quickly. Standardization shifts the mean of each 
feature so that it is centered at zero and each feature has a standard deviation of 1. 
For instance, to standardize the jth feature, we can simply subtract the sample mean 


Lt. 7 ee : _— 
' / from every training sample and divide it by its standard deviation 


a= 


Here, / isa vector consisting of the jth feature values of all training samples n, and 
this standardization technique 1s applied to each feature 7 in our dataset. 


One of the reasons why standardization helps with gradient descent learning 1s that 
the optimizer has to go through fewer steps to find a good or optimal solution (the 
global cost minimum), as illustrated in the following figure, where the subfigures 
represent the cost surface as a function of two model weights in a two-dimensional 
classification problem: 
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Standardization can easily be achieved using the built-in NumPy methods mean and 
SoG. 


>>> X std = np.copy (X) 
poe & Sel ee0) = (ls70)] = xls,0)] sniean()) 


[t, O) estat) 
per Kw SuOley lh) = lied) =~ -Aleg 2) meant) ) ih 


aes 
i sir 


After standardization, we will train Adaline again and see that it now converges after 
a small number of epochs using a learning rate // ~ 0.01. 
por ada = AGalinecbD(n ater=15, eta—v..01) 

eee Bod wt eth. Sea, Y) 


Poe PLOE CSCiIsiOon regions (X Std, yy; Chassitier—ada) 
>>> plt.title('Adaline - Gradient Descent") 

eer Ditsex kabel sepal Length [standarcized|*) 

ee? iplisvylabe.(*peral length Standardized |*) 

>>> plt.legend(loc='upper left') 

ZoP Pitesti gnt. tayour() 

222 Dives how () 


eo DPltsplbou(eenge (1), ben(eda.coste ) + 1), adascoOst ,;, Marker=—"6") 
Per PilsX label (” Epocis.”) 

Por Piliiay adel” Sum—squareso=error’ ) 

>>> plt.show() 
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After executing this code, we should see a figure of the decision regions as well as a 
plot of the declining cost, as shown in the following figure: 


Adaline - Gradient Descent 


petal length [standardized] 
Sum-squared-error 





=2 =1 0 1 2 3 
sepal length [standardized] 


Epochs. 


As we can see in the plots, Adaline has now converged after training on the 


standardized features using a learning rate ’/ ~ 0.01 However, note that the SSE 
remains non-zero even though all samples were classified correctly. 
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Large-scale machine learning and stochastic 
cradient descent 


In the previous section, we learned how to minimize a cost function by taking a step 
in the opposite direction of a cost gradient that 1s calculated from the whole training 
set; this is why this approach is sometimes also referred to as batch gradient 
descent. Now imagine we have a very large dataset with millions of data points, 
which is not uncommon in many machine learning applications. Running batch 
gradient descent can be computationally quite costly in such scenarios since we need 
to reevaluate the whole training dataset each time we take one step towards the 
global minimum. 


A popular alternative to the batch gradient descent algorithm is stochastic gradient 
descent, sometimes also called iterative or online gradient descent. Instead of 
updating the weights based on the sum of the accumulated errors over all samples 


x) 


Aw =n>.,(9 M) b(- o(z"”))x! “ 


We update the weights incrementally for each training sample: 


Although stochastic gradient descent can be considered as an approximation of 
gradient descent, it typically reaches convergence much faster because of the more 
frequent weight updates. Since each gradient is calculated based on a single training 
example, the error surface is noisier than in gradient descent, which can also have the 
advantage that stochastic gradient descent can escape shallow local minima more 
readily if we are working with nonlinear cost functions, as we will see later in 
Chapter 12, /mplementing a Multilayer Artificial Neural Network from Scratch. To 
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obtain satisfying results via stochastic gradient descent, it 1s important to present it 
training data in a random order; also, we want to shuffle the training set for every 
epoch to prevent cycles. 


Note 


In stochastic gradient descent implementations, the fixed learning rate l is often 
replaced by an adaptive learning rate that decreases over time, for example: 


C 


| number of iterations | +c; 


Cc C. 

Where ! and ~ are constants. We shall note that stochastic gradient descent does 
not reach the global minimum, but an area very close to it. And using an adaptive 
learning rate, we can achieve further annealing to the cost minimum. 


Another advantage of stochastic gradient descent is that we can use it for online 
learning. In online learning, our model is trained on the fly as new training data 
arrives. This 1s especially useful if we are accumulating large amounts of data, for 
example, customer data 1n web applications. Using online learning, the system can 
immediately adapt to changes and the training data can be discarded after updating 
the model if storage space is an issue. 


Note 


A compromise between batch gradient descent and stochastic gradient descent is so- 
called mini-batch learning. Mini-batch learning can be understood as applying 
batch gradient descent to smaller subsets of the training data, for example, 32 
samples at a time. The advantage over batch gradient descent 1s that convergence 1s 
reached faster via mini-batches because of the more frequent weight updates. 
Furthermore, mini-batch learning allows us to replace the for loop over the training 
samples in stochastic gradient descent with vectorized operations, which can further 
improve the computational efficiency of our learning algorithm. 


Since we already implemented the Adaline learning rule using gradient descent, we 
only need to make a few adjustments to modify the learning algorithm to update the 
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weights via stochastic gradient descent. Inside the £it method, we will now update 
the weights after each training sample. Furthermore, we will implement an additional 
partial fit method, which does not reinitialize the weights, for online learning. In 
order to check whether our algorithm converged after training, we will calculate the 
cost as the average cost of the training samples in each epoch. Furthermore, we will 
add an option to shuffle the training data before each epoch to avoid repetitive cycles 
when we are optimizing the cost function; via the random state parameter, we 
allow the specification of a random seed for reproducibility: 


class AdalineSGD (object): 
"MMADApDtive Linear NEuron classifier. 


Parameters 

eta : float 
Learning rate (between 0.0 and 1.0) 

i ber & Wie 
Passes over the training dataset. 

shuffle : bool (default: True) 
Shuffles training data every epoch if True 
to prevent cycles. 

Panoom stale | 1b 
Random number generator seed for random weight 
1 Ca Za 0nN. 


Attributes 

w  : ld-array 
Weights after fitting. 

Cose. | LSet 
Sum-of-squares cost function value averaged over all 
training samples in each epoch. 


wesw 


Cot 1010 seit, -ere= 0.01, Diver =10, 
SHurttle=iIruec, rancom state—None) = 
Pell yeld = ee 
Seba eee = ih ere 
sell «aw Ini tialized = False 
Seo = SUE Ee 
Seliereandom Stare = random Stare 


def fit(self, X, y): 


mun Fit training data. 
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deft 


def 


def 


Parameters 

X > (array-like}, shape = [n samples, n_ features] 
Trelnang Vectors, where tm Samples 1s tne number 
of samples and 
O PSaLuUres. 1s the: number Of features. 

V 2 Grroay-like, Siape = Im canis! 
Target values. 


Returns 


self : object 


wey 


Self. 1iClaliZe wergnts.(%.shepe |.) 
SCliecoen.. = i 
FOr 1 1m Dtange(sell.n ater) ; 
if self.shuffle: 
MR, Y = SeClt, Shurrtle(x, Yy) 
cost = [] 
fOr Xi, Carget an. Zip (x, y).2 
CoOstsappeno(selt., Update wergnits(xi, Earget)) 
avg cost = sum(cost) / len(y) 
SClLieCOst. ep Dena. (avg -COst) 
return self 


Pettliai Titiselt, x, YY); 
muMPit training data without reinitializing the weights"™"" 
if not selr.w 1nttialized: 

Sselt. IMitleli7e wei1gnts (x%.shepe |.) ) 
if y.ravel().shape[0] > 1: 

for Xi, target in zZip(%, y): 

Sselt. Updece weights (a, targec) 

else: 

SGlT. Updare weights(x, ¥y) 
return self 


Shuttle (selt, xy Vs 


eUVOnUEr le Lraining Gata’ ™” 
r = self.rgen.permutation(len(y) ) 
return Cie)’; vie 


Initialize welrghnts(sel£t;,. im) : 


muMmTnitialize weights to small random numbers""" 

Scli«lGel = Np. random. Rancomoteve(selt.random Stave) 

Ssecliaw = Selr.1 gen. normal (loc-0.0, scalbe—U.01, 
S1ze=l + m) 
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Seltew Ini vials zed. = True 


Get _update weights (seli, xi, targer): 
"""Aoply Adaline learning rule to update the weights"™"™" 


CULDUL. = Selis.ectivalion(seli met snpuc (22) ) 
error = (target - output) 

Selt.w Llel a= sellseta * 24,000 (error) 
Seltaw: 1) y= Seli.eta. ~ eeror 

COse, = Uno * CrroresZ 


return cost 


def net input(self, X): 
eee lCulLace Net. 2npul* 
return Np.<doLr(x, selt.w [Lily + selt.w [0] 


def activation(self, xX): 
"""Compute linear activation""™" 
return X 


def predict(self, X): 
"""Return class label after unit step""" 
feturn Np.whiere (sell sectivation(seli «ner. 1npul (x)) 
2= 020, dy =]) 


The shuffle method that we are now using 1n the Adalinescp classifier works as 
follows: via the permutation function 1n np. random, we generate a random sequence 
of unique numbers in the range 0 to 100. Those numbers can then be used as indices 
to shuffle our feature matrix and class label vector. 


We can then use the £it method to train the Adalinescp classifier and use our 
plot decision regions to plot our training results: 


Por ada = AMOalaneoGD(h 1ter=lo, eta=U.0l, Tendom state=1) 
POP Goes LC (x. SEG; Y} 


Por Plot. GSCi1si0n regions (X.sta, VY, Classitter—ada) 

PP Vitel ele(*AGdalane = Svochastice Gradient Descent") 

>>> plt.xlabel('sepal length [standardized] ') 

Poo DLe.y label (*peral lengra fetancdarcized|*) 

>>> plt.legend(loc='upper left') 

>>> plt.show () 

Po? DiispLoulwange(l, en(ade. cost.) * 1), ada.Cost , Marker—"0") 
ere Pil «XlLabvel ("EpOCiS” ) 

>>> plt.ylabel ('Average Cost') 

Poe DLE ashow () 
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The two plots that we obtain from executing the preceding code example are shown 
in the following figure: 


Adaline - Stochastic Gradient Descent 


0.200 


0.175 


0.150 


Average Cost 


petal length [standardized] 
& 
= 
a] 
in 





—2 -1 0 1 2 3 
sepal length [standardized] 





As we can see, the average cost goes down pretty quickly, and the final decision 
boundary after 15 epochs looks similar to the batch gradient descent Adaline. If we 
want to update our model, for example, in an online learning scenario with streaming 
data, we could simply call the partial £it method on individual samples—for 
instance ada.partial fit(xX std[0, :], y[0]). 
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Summary 


In this chapter, we gained a good understanding of the basic concepts of linear 
classifiers for supervised learning. After we implemented a perceptron, we saw how 
we can train adaptive linear neurons efficiently via a vectorized implementation of 
gradient descent and online learning via stochastic gradient descent. 


Now that we have seen how to implement simple classifiers in Python, we are ready 
to move on to the next chapter, where we will use the Python scikit-learn machine 
learning library to get access to more advanced and powerful machine learning 
classifiers that are commonly used in academia as well as in industry. The object- 
oriented approach that we used to implement the perceptron and Adaline algorithms 
will help with understanding the scikit-learn API, which is implemented based on the 
same core concepts that we used 1n this chapter: the fit and predict methods. 
Based on these core concepts, we will learn about logistic regression for modeling 
class probabilities and support vector machines for working with nonlinear decision 
boundaries. In addition, we will introduce a different class of supervised learning 
algorithms, tree-based algorithms, which are commonly combined into robust 
ensemble classifiers. 
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Chapter 3. A Tour of Machine 
Learning Classifiers Using scikit-learn 


In this chapter, we will take a tour through a selection of popular and powerful 
machine learning algorithms that are commonly used in academia as well as in 
industry. While learning about the differences between several supervised learning 
algorithms for classification, we will also develop an intuitive appreciation of their 
individual strengths and weaknesses. In addition, we will take our first step with the 
scikit-learn library, which offers a user-friendly interface for using those algorithms 
efficiently and productively. 


The topics that we will learn about throughout this chapter are as follows: 


e Introduction to robust and popular algorithms for classification, such as logistic 
regression, support vector machines, and decision trees 

e Examples and explanations using the scikit-learn machine learning library, 
which provides a wide variety of machine learning algorithms via a user- 
friendly Python API 

e Discussions about the strengths and weaknesses of classifiers with linear and 
non-linear decision boundaries 
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Choosing a classification algorithm 


Choosing an appropriate classification algorithm for a particular problem task 
requires practice; each algorithm has its own quirks and is based on certain 
assumptions. To restate the No Free Lunch theorem by David H. Wolpert, no single 
classifier works best across all possible scenarios (The Lack of A Priori Distinctions 
Between Learning Algorithms, Wolpert and David H, Neural Computation 8.7 
(1996): 1341-1390). In practice, it 1s always recommended that you compare the 
performance of at least a handful of different learning algorithms to select the best 
model for the particular problem; these may differ in the number of features or 
samples, the amount of noise 1n a dataset, and whether the classes are linearly 
separable or not. 


Eventually, the performance of a classifier—computational performance as well as 
predictive power—depends heavily on the underlying data that 1s available for 
learning. The five main steps that are involved in training a machine learning 
algorithm can be summarized as follows: 


1. Selecting features and collecting training samples. 
. Choosing a performance metric. 

. Choosing a classifier and optimization algorithm. 
. Evaluating the performance of the model. 

. Tuning the algorithm. 


OM” B&B W NHN 


Since the approach of this book 1s to build machine learning knowledge step by step, 
we will mainly focus on the main concepts of the different algorithms in this chapter 
and revisit topics such as feature selection and preprocessing, performance metrics, 
and hyperparameter tuning for more detailed discussions later in this book. 
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First steps with scikit-learn — training a 
perceptron 


In Chapter 2, Training Simple Machine Learning Algorithms for Classification, you 
learned about two related learning algorithms for classification, the perceptron rule 
and Adaline, which we implemented in Python by ourselves. Now we will take a 
look at the scikit-learn API, which combines a user-friendly interface with a highly 
optimized implementation of several classification algorithms. The scikit-learn 
library offers not only a large variety of learning algorithms, but also many 
convenient functions to preprocess data and to fine-tune and evaluate our models. 
We will discuss this in more detail, together with the underlying concepts, in Chapter 
4, Building Good Training Sets — Data Preprocessing, and Chapter 5, Compressing 
Data via Dimensionality Reduction. 


To get started with the scikit-learn library, we will train a perceptron model similar 
to the one that we implemented in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification. For simplicity, we will use the already familiar Iris 
dataset throughout the following sections. Conveniently, the Iris dataset 1s already 
available via scikit-learn, since it 1s a simple yet popular dataset that is frequently 
used for testing and experimenting with algorithms. We will only use two features 
from the Iris dataset for visualization purposes. 


We will assign the petal length and petal width of the 150 flower samples to the 
feature matrix x and the corresponding class labels of the flower species to the vector 
v. 


>>> from sklearn import datasets 
>>> import numpy as np 


por Js. = Catlasers. (Oad, 17161) 

Por R= AFIS .Caca |) [2,; 32] ] 

>>> y = iris.target 

eee Prine Class Labels:”’, Npyuniquey)) 
Class Labels: [0 1 Zi 


The np.unique(y) function returned the three unique class labels stored in 
iris.target, and as we see, the Iris flower class names Iris-setosa, Iris- 
versicolor, and Iris-virginica are already stored as integers (here: 0, 1, 2). 
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Although many scikit-learn functions and class methods also work with class labels 
in string format, using integer labels is a recommended approach to avoid technical 
glitches and improve computational performance due to a smaller memory footprint; 
furthermore, encoding class labels as integers is a common convention among most 
machine learning libraries. 


To evaluate how well a trained model performs on unseen data, we will further split 
the dataset into separate training and test datasets. Later in Chapter 6, Learning Best 
Practices for Model Evaluation and Hyperparameter Tuning, we will discuss the 
best practices around model evaluation in more detail: 


o> LPOm SklLeatns.model SselecuLon 12MpOrt Crain test. split 
yo oo ero, 24 -leee, YF Peo, — eer = team Leo ele) 
Ay VY; SOSt 617e=0:.5, Pancom Stace=1, Strarity=y) 


Using the train test split function from scikit-learn's model selection module, 
we randomly split the x and y arrays into 30 percent test data (45 samples) and 70 
percent training data (105 samples). 


Note that the train test split function already shuffles the training sets internally 
before splitting; otherwise, all class 0 and class 1 samples would have ended up in 
the training set, and the test set would consist of 45 samples from class 2. Via the 
random state parameter, we provided a fixed random seed (random state=1) for 
the internal pseudo-random number generator that is used for shuffling the datasets 
prior to splitting. Using such a fixed random state ensures that our results are 
reproducible. 


Lastly, we took advantage of the built-in support for stratification via stratify=y. In 
this context, stratification means that the train test split method returns training 
and test subsets that have the same proportions of class labels as the input dataset. 
We can use NumPy's bincount function, which counts the number of occurrences of 
each value in an array, to verify that this 1s indeed the case: 


>>> print('Labels counts in y:', np.bincount(y) ) 

Labels counts in y: [50 50 50] 

Por Prine Gabels COounes 10. Vy Cia’, MPs binCounL (yy Learn), 
labels COUNTS: 1m Y_ train: [35 39. 35) 

poe Prine" dabele Counce. in Y Cost. ”, Nps imeounl (y Test) ) 
labels Counts 1m y Test. [To Jo. 15] 


Many machine learning and optimization algorithms also require feature scaling for 
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optimal performance, as we remember from the gradient descent example in 
Chapter 2, Training Simple Machine Learning Algorithms for Classification. Here, 
we will standardize the features using the StandardScaler class from scikit-learn's 
preprocessing module: 


>>> from sklearn.preprocessing import StandardScaler 
>>> sc = StandardScaler () 

Poo, Sealey thai) 

PoP Ke elain Seo. = SCatranstorm (x Train) 

vo? x Lest Sta = SC. transtorm(x est) 


Using the preceding code, we loaded the standardScaler class from the 
preprocessing module and initialized a new StandardScaler object that we 
assigned to the sc variable. Using the fit method, StandardScaler estimated the 
parameters 4 (Sample mean) and o (standard deviation) for each feature dimension 
from the training data. By calling the transform method, we then standardized the 
training data using those estimated parameters “’ and 7 . Note that we used the 
same scaling parameters to standardize the test set so that both the values in the 
training and test dataset are comparable to each other. 


Having standardized the training data, we can now train a perceptron model. Most 
algorithms in scikit-learn already support multiclass classification by default via the 
One-versus-Rest (OvR) method, which allows us to feed the three flower classes to 
the perceptron all at once. The code is as follows: 


Por Erom sklearn, linear model Import. Perceptron 


Ze Ppl = Perceprron(n 1ter-40, Gea0=-0..1, random stace=—l) 
2-7 PPlst LUCK. tiain Std, y train) 


The scikit-learn interface reminds us of our perceptron implementation in Chapter 2, 
Training Simple Machine Learning Algorithms for Classification: after loading the 
Perceptron Class from the linear model module, we initialized a new Perceptron 
object and trained the model via the fit method. Here, the model parameter etao is 
equivalent to the learning rate eta that we used in our own perceptron 
implementation, and the n iter parameter defines the number of epochs (passes 
over the training set). 


As we remember from Chapter 2, Training Simple Machine Learning Algorithms for 
Classification, finding an appropriate learning rate requires some experimentation. If 
the learning rate is too large, the algorithm will overshoot the global cost minimum. 
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If the learning rate is too small, the algorithm requires more epochs until 
convergence, which can make the learning slow—especially for large datasets. Also, 
we used the random state parameter to ensure the reproducibility of the initial 
shuffling of the training dataset after each epoch. 


Having trained a model in scikit-learn, we can make predictions via the predict 
method, just like in our own perceptron implementation in Chapter 2, Training 
Simple Machine Learning Algorithms for Classification. The code 1s as follows: 
Poe ¥ Pred = Dpn.predicu(x% Lest Std) 


per Prime Mischassitaed Samples: <a” <« (y ese i= y pred) .sum()) 
Misclassified samples: 3 


Executing the code, we see that the perceptron misclassifies three out of the 45 
flower samples. Thus, the misclassification error on the test dataset is approximately 
(6/45 = 0.067) 


0.067 or 6.7 percent 
Note 


Instead of the misclassification error, many machine learning practitioners report the 
classification accuracy of a model, which is simply calculated as follows: 


l-error = 0.933 or 93.3 percent. 


The scikit-learn library also implements a large variety of different performance 
metrics that are available via the metrics module. For example, we can calculate the 
classification accuracy of the perceptron on the test set as follows: 


pee trom Skleari.metr cs Import accuracy score 


O 


eee PEI LeCuracy, ca4zi* 3 aCCULaCy Score (y test, Y Pred)) 
Recut acy:. U.93 


Here, y test are the true class labels and y pred are the class labels that we 
predicted previously. Alternatively, each classifier in scikit-learn has a score 
method, which computes a classifier's prediction accuracy by combining the predict 
call with accuracy score as shown here: 


O 


Poe PEIN LECUraACy:; <4Zi* «© PDN. SCOLe (x test. Sto, VY vest). ) 
PeCcuracy: U.95 


Note 
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Note that we evaluate the performance of our models based on the test set in this 
chapter. In Chapter 5, Compressing Data via Dimensionality Reduction, you will 
learn about useful techniques, including graphical analysis such as learning curves, 
to detect and prevent overfitting. Overfitting means that the model captures the 
patterns 1n the training data well, but fails to generalize well to unseen data. 


Finally, we can use our plot decision regions function from Chapter 2, Training 
Simple Machine Learning Algorithms for Classification, to plot the decision regions 
of our newly trained perceptron model and visualize how well it separates the 
different flower samples. However, let's add a small modification to highlight the 
samples from the test dataset via small circles: 


from matplotlib.colors import ListedColormap 
import matplotlib.pyplot as plt 


Cet ploe deCcic1oOn. Peqione (x, Vy, Chassitier, Lest. 10x=None, 
resolution=0.02): 


# setup marker generator and color map 


Markers = 4*s" 5 *x"*,;, “Org "SF, 7") 
colors = ('red', ‘'blue'", 'lightgreen', 'gray', ‘'cyan') 
cmap = ListedColormap (colors[:len(np.unique(y))]) 


# plot the decision surface 

x. Min, Xl. Max = Xie, Ulsemint) — 1, Ale, Olemax() + 1 

XZ Min, x2. mex = XxX ey, 1) sman() = ly, Ais, Li «maxt) a 7 

xxl, xXx2Z2 = np.meshgrid(np.arange(xl min, xl max, resolution), 
np.arange(x2 min, x2 max, resolution) ) 

= classifier.predict (np.array([xxl.ravel(), xx2.ravel()]).T) 

Z= 4Z4.reshape (xxl.shape) 

plt.contourf (xxl, xx2, Z, alpha=0.3, cmap=cmap) 

plt.xlim(xxl.min(), xxl.max()) 

PLU. Vilim(xx2 min), xXxXZsmMax{) ) 


IN 
| 


for idx, cl in enumerate (np.unique(y)): 
plt.scatter(x=X[y == cl, OJ], y=Xly == cl, 1], 
alpha=0.8, c=—CcColors. (10x), 
marker=markers[idx], lLabel=cl, 
edgecolor='"black') 


# highlight test samples 
i eee Cs: 
# plot all samples 
X test, y test = X[test idx, :], y[test_idx] 
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Pei weCe ver X teoule,; Ul, 2 voorls;, thy 
c='', edgecolor='black', alpha=1.0, 
linewidth=1, marker='o', 
s=100, label='test set') 


With the slight modification that we made to the plot decision regions function, 
we can now specify the indices of the samples that we want to mark on the resulting 
plots. The code 1s as follows: 


>>> 
>>> 
>>> 


>>> 
>>> 
>>> 
>>> 


x Comoined Sto = 1p.Vsteack( (x train Std, «Gest Srv), 
Vy Combined = Tp.istack((y. Crain, Yy.-esc)) 
DPrPOe G6e1S10n Leq10nSs (x=xX Combined Sta, 

y=y_ combined, 

classifier=ppn, 

Pest toy =lanGe tls, Tov) 
plt.xlabel('petal length [standardized] ') 
plt.ylabel('petal width [standardized] ') 
plt.legend(loc='upper left") 
plt.show() 


As we can see in the resulting plot, the three flower classes cannot be perfectly 
separated by a linear decision boundary: 
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I. 0 
petal length [standardized] 





Remember from our discussion in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification, that the perceptron algorithm never converges on 
datasets that aren't perfectly linearly separable, which is why the use of the 
perceptron algorithm 1s typically not recommended in practice. In the following 
sections, we will look at more powerful linear classifiers that converge to a cost 
minimum even if the classes are not perfectly linearly separable. 


Note 


The Perceptron, as well as other scikit-learn functions and classes, often have 
additional parameters that we omit for clarity. You can read more about those 
parameters using the help function in Python (for instance, help (Perceptron) ) or 
by going through the excellent scikit-learn online documentation at http://scikit- 
learn.org/stable/. 
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Modeling class probabilities via logistic 
regression 


Although the perceptron rule offers a nice and easygoing introduction to machine 
learning algorithms for classification, its biggest disadvantage 1s that 1t never 
converges if the classes are not perfectly linearly separable. The classification task in 
the previous section would be an example of such a scenario. Intuitively, we can 
think of the reason as the weights are continuously being updated since there 1s 
always at least one misclassified sample present in each epoch. Of course, you can 
change the learning rate and increase the number of epochs, but be warned that the 
perceptron will never converge on this dataset. To make better use of our time, we 
will now take a look at another simple yet more powerful algorithm for linear and 
binary classification problems: logistic regression. Note that, in spite of its name, 
logistic regression is a model for classification, not regression. 
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Logistic regression intuition and conditional 
probabilities 


Logistic regression 1s a classification model that is very easy to implement but 
performs very well on linearly separable classes. It is one of the most widely used 
algorithms for classification in industry. Similar to the perceptron and Adaline, the 
logistic regression model in this chapter 1s also a linear model for binary 
classification that can be extended to multiclass classification, for example, via the 
OvR technique. 


To explain the idea behind logistic regression as a probabilistic model, let's first 
introduce the odds ratio: the odds in favor of a particular event. The odds ratio can 
p 
be written as al. where ” stands for the probability of the positive event. The 
term positive event does not necessarily mean good, but refers to the event that we 
want to predict, for example, the probability that a patient has a certain disease; we 


can think of the positive event as class label y= . We can then further define the 
logit function, which is simply the logarithm of the odds ratio (log-odds).: 


logit ( p) = log ip) 


Note that /og refers to the natural logarithm, as it is the common convention in 
computer science. The /ogit function takes as input values in the range 0 to | and 
transforms them to values over the entire real-number range, which we can use to 
express a linear relationship between feature values and the log-odds: 


rl 
logit( p(y =1| x))=w x, + WX, + + w,X,, = = wx =wx 
0 
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Here, p(y=1|x) is the conditional probability that a particular sample belongs to 
class | given its features x. 





Now, we are actually interested in predicting the probability that a certain sample 
belongs to a particular class, which is the inverse form of the logit function. It 1s 
also called logistic sigmoid function, sometimes simply abbreviated to sigmoid 
function due to its characteristic S-shape: 


o(z)=— 


l+e~ 


Here z is the net input, the linear combination of weights and sample features, 
= vs  —— WX -j. WX, fee ote TA! r 


Note 


Note that similar to the convention we used in Chapter 2, Training Simple Machine 
Learning Algorithms for Classification, “* refers to the bias unit, and is an additional 


input value that we provide o which is set equal to 1. 


Now let us simply plot the sigmoid function for some values in the range -7 to 7 to 
see how it looks: 


>>? IMpOLrl. MacpLOovclib.pyplor as: pit 
>>> import numpy as np 

>>> def sigmoid(z): 

soe return 1.0 / (1.0 + np.exp(-z)) 
2oo “2 = Npvarange(-—/, 7, O«1) 

eer DPM. 7% = si1gmord( 7) 

eee Dil. DLOCIZ, Pi. Z) 

>>> plt.axvline(0Q.0, color='"k') 

Poe Deva Limi edy dk) 

>>> plt.xlabel('z') 

>>> plt.ylabel('$\phi (z)$"') 

>>> # y axis ticks and gridline 

Por DleeVCLeKs ([0s0, OQady 1.0)) 

2? ax. = O1T.9qCa{) 

>>> ax.yaxis.grid (True) 
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>>> plt.show() 


As a result of executing the previous code example, we should now see the S-shaped 
(sigmoidal) curve: 





o\ 2) approaches | if z goes towards infinity (< a ) since © ; 
becomes very small for large values of z. Similarly, | z) goes towards 0 for 
z-*~ as a result of an increasingly large denominator. Thus, we conclude that 
this sigmoid function takes real number values as input and transforms them into 


#(2)=0.5 


values in the range [0, 1] with an intercept at © * 


We can see that 


To build some intuition for the logistic regression model, we can relate it to Chapter 
2, Training Simple Machine Learning Algorithms for Classification. In Adaline, we 


6(2)- 


used the identity function ~ as the activation function. In logistic regression, 
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this activation function simply becomes the sigmoid function that we defined earlier. 
The difference between Adaline and logistic regression 1s illustrated in the following 
figure: 


| 
Error | 


(5 ) GE) (F) Predicted class label 





Net input Linear Threshold 
function activation function 
function 


Adaptive Linear Neuron (Adaline) 


Error 


dx | (F) Predicted class label 


RODS QOOE 
DAE 


Net input Sigmoid | Threshold 
function activation | — function 
function 
= —— re 
Logistic Regression | Conditional probability that a 


eee sample belongs to class | given its 
input vector x 


The output of the sigmoid function is then interpreted as the probability of a 


é(z)=P(y=1|x,w) 


‘, given its features x 
| | | o(z)=0.8 
parameterized by the weights w. For example, if we compute °* » fora 
particular flower sample, it means that the chance that this sample is an Iris- 
versicolor flower is 80 percent. Therefore, the probability that this flower is an 


fy = “aw )/—]— P = =" ar | — | } 
Iris-setosa flower can be calculated as P= eet a yall aya or 20 


percent. The predicted probability can then simply be converted into a binary 
outcome via a threshold function: 


particular sample belonging to class 1, 


WOW! eBook 
www.wowebook.org 


[1 ifg(z)20.5 


y=. 


| QO otherwise 


If we look at the preceding plot of the sigmoid function, this 1s equivalent to the 
following: 


1 ifz>0.0 


’ 0 otherwise 


In fact, there are many applications where we are not only interested in the predicted 
class labels, but where the estimation of the class-membership probability is 
particularly useful (the output of the sigmoid function prior to applying the threshold 
function). Logistic regression is used in weather forecasting, for example, not only to 
predict if it will rain on a particular day but also to report the chance of rain. 
Similarly, logistic regression can be used to predict the chance that a patient has a 
particular disease given certain symptoms, which is why logistic regression enjoys 
great popularity in the field of medicine. 
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Learning the weights of the logistic cost function 


You learned how we could use the logistic regression model to predict probabilities 
and class labels; now, let us briefly talk about how we fit the parameters of the 
model, for instance the weights w. In the previous chapter, we defined the sum- 
squared-error cost function as follows: 


J(w)= > (4(2”)-»") 


We minimized this function in order to learn the weights w for our Adaline 
classification model. To explain how we can derive the cost function for logistic 
regression, let's first define the likelihood Z that we want to maximize when we build 
a logistic regression model, assuming that the individual samples in our dataset are 
independent of one another. The formula is as follows: 


L(w) = P(y|x:w) =P] P(» |x) =TT(9(2)) (1-o(2”)) 7 


In practice, it 1s easier to maximize the (natural) log of this equation, which 1s called 
the log-likelihood function: 


(»)= loa L(w)=¥ ! oa(o(2) (1 os 


I-9(2”)) 


Firstly, applying the log function reduces the potential for numerical underflow, 
which can occur if the likelihoods are very small. Secondly, we can convert the 
product of factors into a summation of factors, which makes it easier to obtain the 
derivative of this function via the addition trick, as you may remember from 
calculus. 


Now we could use an optimization algorithm such as gradient ascent to maximize 
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this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost 
function J that can be minimized using gradient descent as in Chapter 2, Training 
Simple Machine Learning Algorithms for Classification: 


FH ’ 


J(w)=S[-v" toe (#(2”))-(1- 9 tog(t-#(2”)) 


f=] 


To get a better grasp of this cost function, let us take a look at the cost that we 
calculate for one single-sample training instance: 


J(¢(z).¥ w)=-y log(¢(z))-(1-y)log(1-¢(z)) 


y= ( 


Looking at the equation, we can see that the first term becomes zero if ~ , and 


the second term becomes zero if -’ — 
[-log((z))f'y= 
|- log(1—¢(z)) ify =0 


Let's write a short code snippet to create a plot that illustrates the cost of classifying 


(z) 


a single-sample instance for different values of 


Pe Oe Cost. Liz): 


bed return - np.log(sigmoid(z) ) 
eee Ger COSe 1Z) = 
return - np.log(l - sigmoid(z) ) 


>>> z = np.arange(-10, 10, O.1) 
Zo. Pal. 2 = Sigmord (Zz) 


Zoe Cl = [Cost (x) 1Or x an: 7] 
yor PleapLOU(pud. 2; Cl, abel ='U(y) 22 Yel") 
por CO = [COst U(x) FOr x An Zz) 

WOW! eBook 


www.wowebook.org 


o> Dele tOG (Da: 4, CO, eet le, Jebel] ot) 2 yao) 
Po JOLba Via Oa soa de) 

Por Pleeselam( TO, 2) ) 

>>> plt.xlabel('$\phiS$(z)') 

>>> plt.ylabel ('J(w)") 

>>> plt.legend(loc='best') 

>>> plt.show() 


The resulting plot shows the sigmoid activation on the x axis, in the range 0 to 1 (the 
inputs to the sigmoid function were z values in the range -10 to 10) and the 
associated logistic cost on the y-axis: 





We can see that the cost approaches 0 (continuous line) if we correctly predict that a 
sample belongs to class 1. Similarly, we can see on the y-axis that the cost also 

_ y= ; 
approaches 0 1f we correctly predict ~ (dashed line). However, if the 
prediction is wrong, the cost goes towards infinity. The main point is that we 
penalize wrong predictions with an increasingly larger cost. 
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Converting an Adaline implementation Into an 
algorithm for logistic regression 


If we were to implement logistic regression ourselves, we could simply substitute the 
cost function J in our Adaline implementation from Chapter 2, Training Simple 
Machine Learning Algorithms for Classification with the new cost function: 


F 


j ( w’ ) — “2 y" log (a zi) ) 4 ( |- y log t — | a ) 


We use this to compute the cost of classifying all training samples per epoch. Also, 
we need to swap the linear activation function with the sigmoid activation and 
change the threshold function to return class labels 0 and | instead of -1 and |. If we 
make those three changes to the Adaline code, we would end up with a working 
logistic regression implementation, as shown here: 


class LogisticRegressionGD (object) : 
munmTogistic Regression Classifier using gradient descent. 


Parameters 
eta : float 
Learning rate (between 0.0 and 1.0) 
mn 2ber 7 1ht 
Passes over the training dataset. 
fFancom state = int 
Random number generator seed for random weight 
initialization. 


Attributes 
Ww ld-array 
Weights after fitting. 
Cose. ] 2sU 
Sum-of-squares cost function value in each epoch. 


GSt 1026 .(selrt, ebe-—U205, 1 wler—100, tandem Srace—1):< 
SGLi«GGa = Gta 
Se2040 eee = Tere 
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selt.©andom Stace = Fanoom State 


def fit(self, X, y): 
wew FLEE. training data « 


Parameters 

xX @ (erray-Like;, Shape.= In Samples, mM teacures| 
Treining Vecrors, where samples 16 tne number oF 
samples and 
in, Features 18 The Number Or Tearures. 

Y + array-like, shape = [nm samples. 
Target values. 


Returns 


self : object 


wesw 


LOC = Np.tandom, Rancomotate(selrt.rangom State) 
Seliaw = Fgem. mortal (hoc=0.0, scale-v.. 01, 

size=l1 + X.shape[1]) 
SeCliecoce. = il 


[Ot 2 2m Pange(Seli.n ther) = 
nee AnpuT. = selt.net. anpur (x) 
OULDUL = Selrt.acli vali oniner. tnpul) 
errors = (y - output) 
Seliaw (Le) += Seli.eta. * 2. 20Or errors) 
SOlLicy Ll t— celivela ~ Ser ore un 


# note that we compute the logistic ‘cost° now 
# instead of the sum of squared errors cost 
COSE = (—Y.000(1p.100 (OuUcpUE)), = 
(41, = 97) -OOt(Np.1O0g ( = OUuLpUuL)))) 
SCL iwCOstl. saPppeno. (Cost) 
return self 


Cel Net Jnpuctselt, x): 
vee Cull fet pu 
return Np.,dour(xX, Seltaw [ie)) + selki.w 104 


der activation (selt, Zz): 
ee COMPUCE: LOGTSELC SLVOMO1d acCulvVation’™™™ 
return be. f (ls + Rpwexp (np. clip(zZ;, =250, 250).)) 


def predict(self, X): 
"""Return class label after unit step""" 


WOW! eBook 
www.wowebook.org 


LSCUIN, Npwwhere(selt met anpue(x) 2S 0.0, Ly 9) 

# equivalent to: 

# return np.where(self.activation(self.net input (X) ) 
# r= U0, jk, 0) 


When we fit a logistic regression model, we have to keep 1n mind that it only works 
for binary classification tasks. So, let us consider only Iris-setosa and Iris- 
versicolor flowers (classes 0 and 1) and check that our implementation of logistic 
regression works: 


>>> 
>>> 
>>> 


>>> 
Poe 
>>> 
>>> 


>>> 
>>> 


xX train OL subset = xX trainl(y train == 0) | (y train == 1)] 
VY tio Ol subset. >= y Urarm(y train == 0) | ty train == 1)'] 
lrgd = LogisticRegressionGD (eta=0.05, 

eee 100, 

random state=1) 
17 G02 tex train 01. .suoset, 


y train O01 subset) The 


PLO decision reqions(4=% trarn O01 subse, 


DEE 


V=y trein 01 Sudsec, 
Closes er= loc) 


smilabel(*peral length (standards zé6a|*) 
pit. 
Oli. 
[ona se 


ylabel ('petal width [standardized] ') 
legend (loc='upper left') 
show () 


The resulting decision region plot looks as follows: 
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petal length [standardized] 





Note 


The gradient descent learning algorithm for logistic regression 


Using calculus, we can show that the weight update in logistic regression via 
gradient descent is equal to the equation that we used in Adaline in Chapter 2, 
Training Simple Machine Learning Algorithms for Classification. However, please 
note that the following derivation of the gradient descent learning rule is intended for 
readers who are interested in the mathematical concepts behind the gradient descent 
learning rule for logistic regression. It is not essential for following the rest of this 
chapter. 


Let's start by calculating the partial derivative of the log-likelihood function with 
respect to the jth weight: 
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Before we continue, let's also calculate the partial derivative of the sigmoid function: 


Oz Cz l+e~ (1+e*) l+e l+e 











= $(z)(1-4(z)) 


Note 


£42 )=$(z)(1-9(z)) | 
Now, we can re-substitute “= in our first equation to 
obtain the a 


gz) 


ow, 


Pay 12) = ). 








laa" (1-7) — ae = |o(=) \(1-A( Ve 


=(v(1-¢(z))-(1 on 
=(y~(z))x, 


Remember that the goal 1s to find the weights that maximize the log-likelihood so 
that we perform the update for each weight as follows: 
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fT 
p seemereen eA { () \\ 4) 
j= | 


Since we update all weights simultaneously, we can write the general update rule as 


follows: wWi=—wWwo Aw 


Aw = 7V1(w) 


We define A’ as follows: 


Since maximizing the log-likelihood is equal to minimizing the cost function J that 
we defined earlier, we can write the gradient descent update rule as follows: 


— = Y 3 (y 9 ( z) ) x 
i=l | 


\’ 





0. 
Aw, =-1)— 
| oO’ 


wi=wt+Aw, Aw =—7VJ(w) 


This is equal to the gradient descent rule for Adaline in Chapter 2, Training Simple 
Machine Learning Algorithms for Classification. 
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Training a logistic regression model with scikit- 
learn 


We just went through useful coding and math exercises 1n the previous subsection, 
which helped illustrate the conceptual differences between Adaline and logistic 
regression. Now, let's learn how to use scikit-learn's more optimized implementation 
of logistic regression that also supports multi-class settings off the shelf (OvR by 
default). In the following code example, we will use the 

sklearn.linear model.LogisticRegression Class as well as the familiar fit 
method to train the model on all three classes in the standardized flower training 
dataset: 


Por LEEOMm Sklcaristineear model Import bOgusStichegression 
Poor. we = OG1IS C1 Cheg essi16n (C=100.0, 2andol Stare=l) 
oe? deeded Wn, eee Oey. “7. aa 
POF PLOl Cecisicon teqg.ons(x% combined std, 

y combined, 

classifier=lr, 
oue test. tOx=-Lrange (lo, 190)) 
>>> plt.xlabel ("petal length [standardized] ') 
>>> plt.ylabel('petal width [standardized] ') 
>>> plt.legend(loc='upper left') 
>>> plt.show() 


After fitting the model on the training data, we plotted the decision regions, training 
samples, and test samples, as shown in the following figure: 
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Looking at the preceding code that we used to train the LogisticRegression model, 
you might now be wondering, "What is this mysterious parameter c?" We will 
discuss this parameter in the next subsection, where we first introduce the concepts 
of overfitting and regularization. However, before we are moving on to those topics, 
let's finish our discussion of class-membership probabilities. 


The probability that training examples belong to a certain class can be computed 
using the predict proba method. For example, we can predict the probabilities of 
the first three samples in the test set as follows: 


per Wi epreclCe Probe ( Vesu suc leo, =)) 


This code snippet returns the following array: 


array (||| 3+201360 7Ce—06, le togosOtCe™U1, o.oo U463Z0e-U 1), 
[ 8.344280609e-O01, e005 71 2olerO Ly 4.57896429e-12], 
| 6249627 75e-01, 1.50817225e-01, 2,050767179E=13) |) 
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The first row corresponds to the class-membership probabilities of the first flower, 
the second row corresponds to the class-membership probabilities of the third flower, 
and so forth. Notice that the columns sum all up to one, as expected (you can 
confirm this by executing 1r.predict proba(X test std[:3, :]).sum(axis=1) ). 
The highest value in the first row 1s approximately 0.853, which means that the first 
sample belongs to class three (Iris-virginica) with a predicted probability of 85.7 
percent. So, as you may have already noticed, we can get the predicted class labels 
by identifying the largest column in each row, for example, using NumPy's argmax 
function: 


2oPr Tiepreorvee proba (x Test Stal2o, =] )saeromeax (axis=1) 


The returned class indices are shown here (they correspond to Iris-virginica, 


Iris-setosa, and Iris-setosa): 


arraytizZ, Oy 01) 


The class labels we obtained from the preceding conditional probabilities is, of 
course, just a manual approach to calling the predict method directly, which we can 
quickly verify as follows: 


Foe Tiepeeovee (x Tese Sseaivo, =) 
array([2, 0, 0]) 


Lastly, a word of caution if you want to predict the class label of a single flower 
sample: sciki-learn expects a two-dimensional array as data input; thus, we have to 
convert a single row slice into such a format first. One way to convert a single row 
entry into a two-dimensional data array 1s to use NumPy's reshape method to add a 
new dimension, as demonstrated here: 


Pee Ie DpEeCOrce(x Test. stall, ¢lwtesnape tl, 1) 
array ([2]) 
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Tackling overfitting via regularization 


Overfitting is a common problem 1n machine learning, where a model performs well 
on training data but does not generalize well to unseen data (test data). If a model 
suffers from overfitting, we also say that the model has a high variance, which can 
be caused by having too many parameters that lead to a model that 1s too complex 
given the underlying data. Similarly, our model can also suffer from underfitting 
(high bias), which means that our model 1s not complex enough to capture the 
pattern in the training data well and therefore also suffers from low performance on 
unseen data. 


Although we have only encountered linear models for classification so far, the 
problem of overfitting and underfitting can be best illustrated by comparing a linear 
decision boundary to more complex, nonlinear decision boundaries as shown in the 
following figure: 





fe] 
Underfitting “2 Good mt Overfitting 
(high bias) compromise (high variance) ~ * 





Note 


Variance measures the consistency (or variability) of the model prediction for a 
particular sample instance if we were to retrain the model multiple times, for 
example, on different subsets of the training dataset. We can say that the model 1s 
sensitive to the randomness 1n the training data. In contrast, bias measures how far 
off the predictions are from the correct values in general 1f we rebuild the model 
multiple times on different training datasets; bias is the measure of the systematic 
error that is not due to randomness. 
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One way of finding a good bias-variance tradeoff is to tune the complexity of the 
model via regularization. Regularization 1s a very useful method to handle 
collinearity (high correlation among features), filter out noise from data, and 
eventually prevent overfitting. The concept behind regularization 1s to introduce 
additional information (bias) to penalize extreme parameter (weight) values. The 
most common form of regularization 1s so-called L2 regularization (sometimes also 
called L2 shrinkage or weight decay), which can be written as follows: 


A 4 A iif é 
wll ==) w 
2 2 j=] 

















Here, “ is the so-called regularization parameter. 


Note 


Regularization is another reason why feature scaling such as standardization 1s 
important. For regularization to work properly, we need to ensure that all our 
features are on comparable scales. 


The cost function for logistic regression can be regularized by adding a simple 
regularization term, which will shrink the weights during model training: 


Fl 


1(0)= fy! e(o(2"))-(1-y)oa(t-o(°))] Ato 


Via the regularization parameter “ , we can then control how well we fit the training 


data while keeping the weights small. By increasing the value of “ , we increase the 
regularization strength. 


The parameter c that is implemented for the LogisticRegression Class in scikit- 
learn comes from a convention in support vector machines, which will be the topic 


of the next section. The term c is directly related to the regularization parameter “ , 
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which is its inverse. Consequently, decreasing the value of the inverse regularization 
parameter c means that we are increasing the regularization strength, which we can 
visualize by plotting the L2-regularization path for the two weight coefficients: 


>>> weights, params = [], [] 
>>> for c in np.arange(-5, 5): 
i= LOGS Ui.Ckegress1on (C=-10.4*C, Pencom -stalce= 1) 


ivst Les Grain. Std, Y tiaan) 
WelOQnUSseppena (lr cost [ll 
nee params.append(10.**c) 
>>> weights = np.array (weights) 
>>> plt.plot(params, weights[:, QO], 
sah label='"petal length') 
>>> plt.plot(params, weights[:, 1], linestyle='--', 
Sas label='petal width') 
>>> plt.ylabel('weight coefficient") 
Po> ples«klabeL.(*e") 
>>> plt.legend(loc='upper left') 
>>> plt.xscale('log') 
>>> plt.show() 


By executing the preceding code, we fitted ten logistic regression models with 
different values for the inverse-regularization parameter c. For the purposes of 
illustration, we only collected the weight coefficients of class 1 (here, the second 
class in the dataset, Iris-versicolor) versus all classifiers—remember that we are 
using the OvR technique for multiclass classification. 


As we can see in the resulting plot, the weight coefficients shrink 1f we decrease 
parameter c, that 1s, 1f we increase the regularization strength: 
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Note 


Since an in-depth coverage of the individual classification algorithms exceeds the 
scope of this book, I strongly recommend Logistic Regression: From Introductory to 
Advanced Concepts and Applications, Dr. Scott Menard's, Sage Publications, 2009, 
to readers who want to learn more about logistic regression. 
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Maximum margin classification with 
Support vector machines 


Another powerful and widely used learning algorithm is the Support Vector 
Machine (SVM), which can be considered an extension of the perceptron. Using the 
perceptron algorithm, we minimized misclassification errors. However, in SVMs our 
optimization objective is to maximize the margin. The margin 1s defined as the 
distance between the separating hyperplane (decision boundary) and the training 
samples that are closest to this hyperplane, which are the so-called support vectors. 
This is illustrated in the following figure: 


Margin 


Support vectors 








Decision boundary |< 


T: = , See Pe fo 
w'x = 0 é —. « . 

“T° 6ys 7 
negative — er A, r ______— positive 
hyperplane O hyperplane 

Te — 
w'x = -| w'x = | 





Xj 


Which hyperplane? 


Maximize the margin 
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Maximum margin intuition 


The rationale behind having decision boundaries with large margins is that they tend 
to have a lower generalization error whereas models with small margins are more 
prone to overfitting. To get an idea of the margin maximization, let's take a closer 
look at those positive and negative hyperplanes that are parallel to the decision 
boundary, which can be expressed as follows: 

Wotwx=1 (1) 


pos 


If we subtract those two linear equations (1) and (2) from each other, we get: 


= w"(x,,,-x,,)=2 


on™ pas iar neg 


We can normalize this equation by the length of the vector w, which 1s defined as 
follows: 


pol = 





So we arrive at the following equation: 
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= = | 
w Lis 7 Tis) _ Z 


| al 


The left side of the preceding equation can then be interpreted as the distance 
between the positive and negative hyperplane, which is the so-called margin that we 


want to maximize. 


Now, the objective function of the SVM becomes the maximization of this margin 
f 


— 


le 
by maximizing ' 
which can be written as: 











under the constraint that the samples are classified correctly, 


W, ty x! >1if y” 


tw x <-1if y" —| 


MOC se = ty 


Here, N 1s the number of samples in our dataset. 


These two equations basically say that all negative samples should fall on one side of 
the negative hyperplane, whereas all the positive samples should fall behind the 
positive hyperplane, which can also be written more compactly as follows: 
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y") ( w, + wx”) >1V 


Ly oy 

= || 
In practice though, it 1s easier to minimize the reciprocal term 2 , which can be 
solved by quadratic programming. However, a detailed discussion about quadratic 
programming is beyond the scope of this book. You can learn more about support 
vector machines in 7he Nature of Statistical Learning Theory, Springer 
Science+Business Media, Vladimir Vapnik, 2000 or Chris J.C. Burges' excellent 
explanation in A Tutorial on Support Vector Machines for Pattern Recognition (Data 
Mining and Knowledge Discovery, 2(2): 121-167, 1998). 
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Dealing with a nonlinearly separable case using 
Slack variables 


Although we don't want to dive much deeper into the more involved mathematical 
laa daa _— the maximum-margin classification, let us briefly mention the slack 


varteiie i , which was introduced by Vladimir Vapnik in 1995 and led to the so- 
called aiheanete classification. The motivation for introducing the slack variable 


© was that the linear constraints need to be relaxed for nonlinearly separable data to 
allow the convergence of the optimization in the presence of misclassifications, 
under appropriate cost penalization. 


The positive-values slack variable is simply added to the linear constraints: 


Ww, + wx) <-14 EW if y =-] 
fori=1...N 


Here, NV is the number of samples in our dataset. So the new objective to be 
minimized (subject to the constraints) becomes: 


All zap) f) 


I 
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Via the variable c, we can then control the penalty for misclassification. Large values 
of c correspond to large error penalties, whereas we are less strict about 
misclassification errors if we choose smaller values for c. We can then use the c 
parameter to control the width of the margin and therefore tune the bias-variance 
trade-off, as illustrated in the following figure: 





X> 
| + 
| 
ACE 
Ot. T 
Oo, 7 | 
O | O 
x xX 
Large value for ~* Small value for 7 
parameter C parameter C 


This concept is related to regularization, which we discussed in the previous section 
in the context of regularized regression where decreasing the value of c increases the 
bias and lowers the variance of the model. 


Now that we have learned the basic concepts behind a linear SVM, let us train an 
SVM model to classify the different flowers in our Iris dataset: 


>>> from sklearn.svm import SVC 
Por 6Vl, = eVvVG(Kernel="1inear’, C-l.0, fandom Strate) 
Poo SVlwe LLU Crain Sta, YY train) 
Por Plot. OSC1510n regions (xX combined. Std, 
y combined, 
classifier=svm, 
sea Lest 20x=range (105,. 150) ) 
>>> plt.xlabel ("petal length [standardized] ') 
>>> plt.ylabel ("petal width [standardized]') 
>>> plt.legend(loc='upper left') 
>>> plt.show() 
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The three decision regions of the SVM, visualized after training the classifier on the 
Iris dataset by executing the preceding code example, are shown in the following 
plot: 
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Note 


Logistic regression versus support vector machines 


In practical classification tasks, linear logistic regression and linear SVMs often 
yield very similar results. Logistic regression tries to maximize the conditional 
likelihoods of the training data, which makes it more prone to outliers than SVMs, 
which mostly care about the points that are closest to the decision boundary (support 
vectors). On the other hand, logistic regression has the advantage that it 1s a simpler 
model and can be implemented more easily. Furthermore, logistic regression models 
can be easily updated, which 1s attractive when working with streaming data. 
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Alternative implementations in scikit-learn 


The scikit-learn library's perceptron and LogisticRegression Classes, which we 
used 1n the previous sections, make use of the LIBLINEAR library, which is a highly 
optimized C/C++ library developed at the National Taiwan University 
(http://www.csie.ntu.edu.tw/~cjlin/liblinear/). Similarly, the svc class that we used to 
train an SVM makes use of LIBSVM, which is an equivalent C/C++ library 


specialized for SVMs (http://www.csie.ntu.edu.tw/~cylin/libsvm/). 


The advantage of using LIBLINEAR and LIBSVM over native Python 
implementations is that they allow the extremely quick training of large amounts of 
linear classifiers. However, sometimes our datasets are too large to fit into computer 
memory. Thus, scikit-learn also offers alternative implementations via the 
SGDClassifier Class, which also supports online learning via the partial fit 
method. The concept behind the scpclassifier class is similar to the stochastic 
gradient algorithm that we implemented in Chapter 2, Training Simple Machine 
Learning Algorithms for Classification, for Adaline. We could initialize the 
stochastic gradient descent version of the perceptron, logistic regression, and a 
support vector machine with default parameters as follows: 


Por Eom Skicarm.ltinear model 1mpork. sGDClassitier 
>>> ppn = SGDClassifier (loss='perceptron') 

>>> lr = SGDClassifier(loss='log') 

>>> svm = SGDClassifier(loss='hinge') 
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Solving nonlinear problems using a 
kernel SVM 


Another reason why SVMs enjoy high popularity among machine learning 
practitioners 1s that it can be easily kernelized to solve nonlinear classification 
problems. Before we discuss the main concept behind a kernel SVM, let's first 
create a sample dataset to see what such a nonlinear classification problem may look 
like. 


WOW! eBook 
www.wowebook.org 


Kernel methods for linearly inseparable data 


Using the following code, we will create a simple dataset that has the form of an 
XOR gate using the logical or function from NumPy, where 100 samples will be 
assigned the class label 1, and 100 samples will be assigned the class label -1: 


>Po> AMPOTE MarpLlLoclib.oyploue 2s ple 
>>> import numpy as np 
>>> np.random.seed (1) 


Poo K XOr = Np. random.randn (7200; 2) 
Poe ¥ XOr = 1p.10G1Cal XOr(x Kors, 
J 


Pee YF XKOL = fip.where(y xor, ty, =1) 
>>> ple. 


=> Le. 


>>> pLt 


[> ple. 


>>> ple 
> ioe 


x xO |: 


SCalrer(x xorly “or == 1, 01, 
X xor[y xor == 1, Ll], 
c='b', marker='x', 
label='1') 

SCalcer(x xOrly xor == 1, VU], 
x XOrly -or j= —1y ly 
Ca" iE, 
marker='s', 
label='-1") 

glam el =3y 3.) 


ViIM Coy 2.) 


. Legend (loc='"best'") 
. show () 


After executing the code, we will have an XOR dataset with random noise, as shown 
in the following figure: 
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Obviously, we would not be able to separate samples from the positive and negative 
class very well using a linear hyperplane as a decision boundary via the linear 
logistic regression or linear SVM model that we discussed 1n earlier sections. 


The basic idea behind kernel methods to deal with such linearly inseparable data is 
to create nonlinear combinations of the original features to project them onto a 


higher-dimensional space via a mapping function p where it becomes linearly 
separable. As shown 1n the following figure, we can transform a two-dimensional 
dataset onto a new three-dimensional feature space where the classes become 
separable via the following projection: 


| | | z pe 2 
p(x, 97 ) = (z, <5 Z;) = (x, i X5 A; “{ X5 
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This allows us to separate the two classes shown in the plot via a linear hyperplane 
that becomes a nonlinear decision boundary if we project it back onto the original 
feature space: 
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Using the kernel trick to find separating 
hyperplanes in high-dimensional space 


To solve a nonlinear problem using an SVM, we would transform the training data 


onto a higher-dimensional feature space via a mapping function p and train a linear 
SVM model to classify the data in this new feature space. Then, we can use the same 


mapping function p to transform new, unseen data to classify it using the linear 
SVM model. 


However, one problem with this mapping approach is that the construction of the 
new features is computationally very expensive, especially if we are dealing with 
high-dimensional data. This is where the so-called kernel trick comes into play. 
Although we didn't go into much detail about how to solve the quadratic 
programming task to train an SVM, in practice all we need is to replace the dot 


oe ae ae 
“ar _@) op ( yl! (p ( x | 
product * * by : ’ . In order to save the expensive step of 


calculating this dot product between two points explicitly, we define a so-called 
K a, x) | _ o( x" b( x" 


One of the most widely used kernels is the Radial Basis Function (RBF) kernel or 
simply called the Gaussian kernel: 


| x" i a 
We? 


kernel function: 


K ix, x) = exp 





This is often simplified to: 


K(x, x)= exp(-7 |x Un% | | 
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| 


————— 


ri 


Here, 20” isa free parameter that 1s to be optimized. 


Roughly speaking, the term kernel can be interpreted as a similarity function 
between a pair of samples. The minus sign inverts the distance measure into a 
similarity score, and, due to the exponential term, the resulting similarity score will 
fall into a range between | (for exactly similar samples) and 0 (for very dissimilar 
samples). 


Now that we defined the big picture behind the kernel trick, let us see if we can train 
a kernel SVM that is able to draw a nonlinear decision boundary that separates the 
XOR data well. Here, we simply use the svc class from scikit-learn that we imported 
earlier and replace the kernel='linear' parameter with kernel='rbf': 


Poo SVN = SVG (Kernel="Frot*, Pandom stace—-l, Gamme—U.10, C=L0..0) 
Por SVMs Litt x XOr, y xXOLr) 

Per PlLOL OSCIS10n Teqlons(x% xor;, Y sor, classi iier-svin) 

>>> plt.legend(loc='upper left') 

>>> plt.show() 


As we can see in the resulting plot, the kernel SVM separates the XOR data 
relatively well: 
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The ” parameter, which we set to gamma=0.1, can be understood as a cut-off 


parameter for the Gaussian sphere. If we increase the value for y , We increase the 
influence or reach of the training samples, which leads to a tighter and bumpier 


decision boundary. To get a better intuition for y , let us apply an RBF kernel SVM 
to our Iris flower dataset: 


>>> 
>>> 
Loe 


aa 
>>> 
Ze 
>>> 


svm = SVC(kernel="rbf", random state=l1, gamma=0.2, C=1.0) 
SVil<. FLUX train Sud, Vrain) 
plow decision TSg10ns (xX combined. scd, 

y Combined, Classitver-svm, 

Lest. 1Ox=range (05,150) ) 
plt.xlabel('petal length [standardized] ') 
plt.ylabel ('petal width [standardized] ') 
plt.legend(loc='upper left") 
plt.show () 


Since we chose a relatively small value for y , the resulting decision boundary of the 
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RBF kernel SVM model will be relatively soft, as shown in the following figure: 
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Now, let us increase the value of ’ and observe the effect on the decision boundary: 


ger SVM = oVCCKGEnNe L="EDE*, Landom state-l; Ggamma-10U.0, “C-1.0) 
yo? Siete sh Chai Sud, train) 
Pee DLOy. CeCisS10n- P6eg10ns (x combined. std, 
y Combine, Classic ver=svil, 
sig Lest. tOx=range (105,150).) 
>>> plt.xlabel('petal length [standardized] ') 
>>> plt.ylabel('petal width [standardized] ') 
>>> plt.legend(loc='"upper left') 
>>> plt.show() 


In the resulting plot, we can now see that the decision boundary around the classes 0 


and 1 1s much tighter using a relatively large value of i. 
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test set 
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Although the model fits the training dataset very well, such a classifier will likely 


have a high generalization error on unseen data. This illustrates that the y parameter 
also plays an important role in controlling overfitting. 
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Decision tree learning 


Decision tree classifiers are attractive models if we care about interpretability. As 
the name decision tree suggests, we can think of this model as breaking down our 
data by making decision based on asking a series of questions. 


Let's consider the following example in which we use a decision tree to decide upon 
an activity on a particular day: 


Work to do? Internal 


node 
Yes No ,, 
Outlook? | 
_ Branch 
Sunny / Rainy 
Over- 
cast 


- Yes “No 


Leaf 


node 





Based on the features in our training set, the decision tree model learns a series of 
questions to infer the class labels of the samples. Although the preceding figure 
illustrates the concept of a decision tree based on categorical variables, the same 
concept applies if our features are real numbers, like in the Iris dataset. For example, 
we could simply define a cut-off value along the sepal width feature axis and ask a 
binary question "Is sepal width > 2.8 cm?." 


Using the decision algorithm, we start at the tree root and split the data on the feature 
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that results in the largest Information Gain (IG), which will be explained in more 
detail in the following section. In an iterative process, we can then repeat this 
splitting procedure at each child node until the leaves are pure. This means that the 
samples at each node all belong to the same class. In practice, this can result in a 
very deep tree with many nodes, which can easily lead to overfitting. Thus, we 
typically want to prune the tree by setting a limit for the maximal depth of the tree. 
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Maximizing information gain — getting the most 
bang for your buck 


In order to split the nodes at the most informative features, we need to define an 
objective function that we want to optimize via the tree learning algorithm. Here, our 
objective function is to maximize the information gain at each split, which we define 
as follows: 


IG(D,, f)=1(D, ) > = I(D,} 


_ dD By 
Here, fis the feature to perform the split, ” and / are the dataset of the parent 


, | N 
and jth child node, /1s our impurity measure, ~” 1s the total number of samples at 
AT 


the parent node, and ~! is the number of samples in the jth child node. As we can 
see, the information gain is simply the difference between the impurity of the parent 
node and the sum of the child node impurities — the lower the impurity of the child 
nodes, the larger the information gain. However, for simplicity and to reduce the 
combinatorial search space, most libraries (including scikit-learn) implement binary 
decision trees. This means that each parent node is split into two child nodes, 


‘2 


and rReri ‘ 


: Vien . tN igh | 
IG(D,, f)=1(D, )-—*1(Da, )- yo (Dreic) 


P P 


Now, the three impurity measures or splitting criteria that are commonly used in 


| = I | 
binary decision trees are Gini impurity ( © ), entropy ( ” ), and the classification 


error (A ®), Let us start with the definition of entropy for all non-empty classes ( 


WOW! eBook 
www.wowebook.org 


p(i|t)# 0». 


c 


ty (t) = -» pli | t) log, p(i | t) 


i=l 


(i \ Tr 

Here, f ( ) is the proportion of the samples that belong to class c for a particular 
node ¢. The entropy is therefore 0 if all samples at a node belong to the same class, 
and the entropy is maximal if we have a uniform class distribution. For example, in a 


(f=1|2)=1 ot pli=O0|t)=0 


classes are distributed uniformly with Prete) =es and 7! , the 
entropy 1s 1. Therefore, we can say that the entropy criterion attempts to maximize 


the mutual information in the tree. 


. ere :, 
binary class setting, the entropy 1s 0 if I 
i=O0{r)=0.5 


Intuitively, the Gini impurity can be understood as a criterion to minimize the 
probability of misclassification: 


-F toeN-v0ao-1-Lete 


Similar to entropy, the Gini impurity 1s maximal if the classes are perfectly mixed, 


for example, in a binary class setting (© — 2 ): 


I,,(t)=1- 0.5? =0.5 
f=] 


However, in practice both Gini impurity and entropy typically yield very similar 
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results, and it is often not worth spending much time on evaluating trees using 
different impurity criteria rather than experimenting with different pruning cut-offs. 


Another impurity measure is the classification error: 


f. =1-max |} p(i|t)| 


This 1s a useful criterion for pruning but not recommended for growing a decision 
tree, since it 1s less sensitive to changes in the class probabilities of the nodes. We 
can illustrate this by looking at the two possible splitting scenarios shown in the 
following figure: 





. D D . 
We start with a dataset ” atthe parentnode ”, which consists 40 samples from 
4 D. dD. m4 
class | and 40 samples from class 2 that we split into two datasets, “ and "*". 
The information gain using the classification error as a splitting criterion would be 
K.. =0.25 
the same (fo; O25 


[,(D,)=1-0.5=0.5 


ri 
ie 


) in both scenarios, A and B: 
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B: 1, (Digy )=1-1=0 


Ss oe (ee =0,5-2x--0=0.25 


However, the Gini impurity would favor the split in scenario B (1Se = 0.16 


IG, = 0.125 


) over 


scenario A ( ), which 1s indeed more pure: 
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I,(D,)=1-(0.5° +0.5°) =0.5 


‘ 3 2 7 2 2 
A:l, (0.4)=1-{(2 (5) |-2e0ar 


A:IG_, =9.5 -=0.375 -=0.375 = (0.125 


B:1.(D)=!-|(2 {= |-4-04 


B:1,(D 


“right 


)=1-(1 +0°)=0 
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BIG, =0.5~20.4-0=0.16 


IG, =0.3 


Similarly, the entropy criterion would also favor scenario B ( | ) over 


IG,, =0.19 


scenario A ( 


I,,(D,)=—(0.5 log, (0.5) +0.5 log, (0.5)) =1 


fi 


3 3) 1 | | 
A:Iy(Dyy)=-{F108,(3}+410¢,(+)]=081 


(1 "LY, Ba. £3 
A:l,|D_.,,)=—| —log,| — |+—log, 7 =(0.81 
(Pr) E «(| 4 23), 


A:1Gy =1-=081-—081=0.19 
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Bily| Dig |=9 


B:IG,, =1-£0.92-0=0.31 


For a more visual comparison of the three different impurity criteria that we 
discussed previously, let us plot the impurity indices for the probability range [0, 1] 
for class 1. Note that we will also add a scaled version of the entropy (entropy / 2) to 
observe that the Gini impurity 1s an intermediate measure between entropy and the 
classification error. The code is as follows: 


eee AMpOLrL: MaltpLoulib.pyplbor as pit 
7o> AMPOrG NuMmpy as. 1p 
oo > OCT O1n1.(p) 


as roca (oy = 1) he Dee ee ep) 

>>> def entropy(p): 

ee recurn = p*np.logz >) — (1 = Pp) *np.bogz( tl = p)) 
>>> def error(p): 

a return 1 = np.smax( |p, 1 = pl) 

>>> X = np.arange(0.0, 1.0, O.O1) 

>>> ent = [entropy(p) if p '= 0 else None for p in xX] 
wor ee ee = Tere. Je Clee None £OF i am eae) 

>>> err = [error(1) for 1 1n xX] 


>>> fig = plt.figure() 
>>> ax = plt.subplot(1l1l) 
Bee Or ay wal, Wey Gp 2 Zip Cienie, SC ene, Gam) | creel, 
['Entropy', ‘Entropy (scaled)', 
'GEnc: LMDUEEL yy 
‘Mi SsClasstttcatiton Freror’ |, 
[og — ey oar =. 1s 
[*Dleack',; “lightoray*, 
"Leo", “Green; *Ccyan” |) > 
line = ax.plot(x, i, label=lab, 
oes linestyle=ls, lw=2, color=c) 
PoP Ox. MeOenO(VOC="Ipper Center’, DbDOx To anchor—(0.5, 115); 
nae ncol=5, fancybox=True, shadow=False) 
>>> ax.axhline(y=0.5, linewidth=1, color='k"', linestyle='--') 
>>> ax.axhline(y=1.0, linewidth=1, color='"k', linestyle='--') 
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>>> plt.xlabel ('p(i=1)'") 
>>> plt.ylabel ('Impurity Index') 
eer Dit«Ssiow () 


SoS PLT VLIM boy Gs a 
(' 
( 


The plot produced by the preceding code example 1s as follows: 


=—— Entropy Entropy (scaled) === Ginilmpurity —™:*= Misclassification Error 


Impurity Index 





0.0 0.2 0.4 0.6 0.8 LQ 
p(i=1) 


WOW! eBook 
www.wowebook.org 


Building a decision tree 


Decision trees can build complex decision boundaries by dividing the feature space 
into rectangles. However, we have to be careful since the deeper the decision tree, 
the more complex the decision boundary becomes, which can easily result in 
overfitting. Using scikit-learn, we will now train a decision tree with a maximum 
depth of 3, using entropy as a criterion for impurity. Although feature scaling may be 
desired for visualization purposes, note that feature scaling 1s not a requirement for 
decision tree algorithms. The code is as follows: 


yo 
>>> 


2 
>>> 
>>> 
>>> 


>>> 
>>> 
>>> 
>>> 


from sklearn.tree import DecisionTreeClassifier 
tree = DecisionTreeClassifier(criterion="gini', 
max depth=4, 
Kancom, Stave=_.) 
LeSCe shits train, 4 Train) 
xX. Combined. = fps vstack( (xX Crain; XxX test).) 
VY -COMmoLned = Np.nStacki((y Crain, Y ces). 
PploL. decision regions (x combined, 
y combined, 
classifier=tree, 
Lest. 1ox=range (105, 150) ) 
plt.xlabel (‘petal length [cm]') 
plt.ylabel ('petal width [cm]') 
plt.legend(loc='"'upper left") 
pilt.show () 


After executing the code example, we get the typical axis-parallel decision 
boundaries of the decision tree: 


WOW! eBook 
www.wowebook.org 


test set 

















) 
a a 
deed 
<< 
= 
is) 
4 
8] 
©. 


3 4 5 
petal length [cm] 





A nice feature in scikit-learn is that it allows us to export the decision tree as a .dot 
file after training, which we can visualize using the GraphViz program, for example. 


This program is freely available from http://www.graphviz.org and supported by 
Linux, Windows, and macOS. In addition to GraphViz, we will use a Python library 
called pydotplus, which has capabilities similar to Graph Viz and allows us to 
convert .dot data files into a decision tree image file. After you installed GraphViz 
(by following the instructions on http://www.graphviz.org/Download.php), you can 
install pydotplus directly via the pip installer, for example, by executing the 
following command in your Terminal: 


> pip3 install pydotplus 


Note 


Note that on some systems, you may have to install the pydotplus prerequisites 
manually by executing the following commands: 
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pip3 install graphviz 
pip3 install pyparsing 


The following code will create an image of our decision tree in PNG format in our 
local directory: 


>>> from pydotplus import graph from dot data 
Zee TEOMm. Sk ESAT. LLeS IMpOLE SXpOLL. OrapaviZ 
Peo OU Cala = SGxporl. Grapnvizg (tree, 
filled=True, 
LOuUnGCed=lrue; 
Class: Names=|"Setv0sa.”,; 
"VErsercolor’, 
‘Viroiniea iy 
FeaLUre MNemes=| "Peta. Lengel”, 
"petal width'], 
eas out file=None) 
Ze OVaph = Oreapn. From GOU Gala(ool CGdta) 
Por Graph.write png (tree. png”) 


By using the out file=None setting, we directly assigned the dot data toa dot data 
variable, instead of writing an intermediate tree.dot file to disk. The arguments for 
filled, rounded, class names, and feature names are optional but make the 
resulting image file visually more appealing by adding color, rounding the box 
edges, showing the name of the majority class label at each node, and displaying the 
feature names 1n the splitting criterion. These settings resulted in the following 
decision tree image: 
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petal width <= 0.75 
gini = 0.6667 
samples = 105 
value = [35, 35, 35] 

class = Setosa 






\False 


True / 


petal length <= 4.75 
gini=0.5 
samples = /0 
value = [0, 35, 35] 
class = Versicolor , 










petal length <= 4.95 
gini = 0.5 
samples = 8 
value = [0, 4, 4] 

_ class = Versicolor 








Looking at the decision tree figure, we can now nicely trace back the splits that the 
decision tree determined from our training dataset. We started with 105 samples at 
the root and split them into two child nodes with 35 and 70 samples, using the petal 
width cut-off < 0.75 cm. After the first split, we can see that the left child node is 
already pure and only contains samples from the Iris-setosa class (Gini Impurity = 
0). The further splits on the right are then used to separate the samples from the 


Tris-versicolor and LEDRS=ViLrogitn1ca class. 


WOW! eBook 
www.wowebook.org 


Looking at this tree, and the decision region plot of the tree, we see that the decision 
tree does a very good job of separating the flower classes. Unfortunately, scikit-learn 
currently does not implement functionality to manually post-prune a decision tree. 
However, we could go back to our previous code example, change the max depth of 
our decision tree to 3, and compare it to our current model, but we leave this as an 
exercise for the interested reader. 
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Combining multiple decision trees via random 
forests 


Random forests have gained huge popularity in applications of machine learning 
during the last decade due to their good classification performance, scalability, and 
ease of use. Intuitively, a random forest can be considered as an ensemble of 
decision trees. The idea behind a random forest is to average multiple (deep) 
decision trees that individually suffer from high variance, to build a more robust 
model that has a better generalization performance and 1s less susceptible to 
overfitting. The random forest algorithm can be summarized in four simple steps: 


1. Draw a random bootstrap sample of size n (randomly choose n samples from 
the training set with replacement). 
2. Grow a decision tree from the bootstrap sample. At each node: 


a. Randomly select d features without replacement. 


b. Split the node using the feature that provides the best split according to the 
objective function, for instance, maximizing the information gain. 

3. Repeat the steps 1-2 é times. 

4. Aggregate the prediction by each tree to assign the class label by majority vote. 
Majority voting will be discussed in more detail in Chapter 7, Combining 
Different Models for Ensemble Learning. 


We should note one slight modification in step 2 when we are training the individual 
decision trees: instead of evaluating all features to determine the best split at each 
node, we only consider a random subset of those. 


Note 


In case you are not familiar with the terms sampling with and without replacement, 
let's walk through a simple thought experiment. Let's assume we are playing a lottery 
game where we randomly draw numbers from an urn. We start with an urn that holds 
five unique numbers, 0, 1, 2,3, and 4, and we draw exactly one number each turn. In 
the first round, the chance of drawing a particular number from the urn would be 1/5. 
Now, in sampling without replacement, we do not put the number back into the urn 
after each turn. Consequently, the probability of drawing a particular number from 
the set of remaining numbers in the next round depends on the previous round. For 
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example, if we have a remaining set of numbers 0, 1, 2, and 4, the chance of drawing 
number 0 would become 1/4 in the next turn. 


However, 1n random sampling with replacement, we always return the drawn 
number to the urn so that the probabilities of drawing a particular number at each 
turn does not change; we can draw the same number more than once. In other words, 
in sampling with replacement, the samples (numbers) are independent and have a 
covariance of zero. For example, the results from five rounds of drawing random 
numbers could look like this: 


e Random sampling without replacement: 2, 1, 3, 4, 0 
e Random sampling with replacement: 1, 3, 3, 4, 1 


Although random forests don't offer the same level of interpretability as decision 
trees, a big advantage of random forests 1s that we don't have to worry so much about 
choosing good hyperparameter values. We typically don't need to prune the random 
forest since the ensemble model 1s quite robust to noise from the individual decision 
trees. The only parameter that we really need to care about in practice 1s the number 
of trees & (step 3) that we choose for the random forest. Typically, the larger the 
number of trees, the better the performance of the random forest classifier at the 
expense of an increased computational cost. 


Although it is less common in practice, other hyperparameters of the random forest 
classifier that can be optimized—using techniques we will discuss in Chapter 5, 
Compressing Data via Dimensionality Reduction—are the size n of the bootstrap 
sample (step 1) and the number of features d that 1s randomly chosen for each split 
(step 2.1), respectively. Via the sample size n of the bootstrap sample, we control the 
bias-variance tradeoff of the random forest. 


Decreasing the size of the bootstrap sample increases the diversity among the 
individual trees, since the probability that a particular training sample is included in 
the bootstrap sample is lower. Thus, shrinking the size of the bootstrap samples may 
increase the randomness of the random forest, and it can help to reduce the effect of 
overfitting. However, smaller bootstrap samples typically result in a lower overall 
performance of the random forest, a small gap between training and test 
performance, but a low test performance overall. Conversely, increasing the size of 
the bootstrap sample may increase the degree of overfitting. Because the bootstrap 
samples, and consequently the individual decision trees, become more similar to 
each other, they learn to fit the original training dataset more closely. 
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In most implementations, including the RandomForestClassifier implementation in 
scikit-learn, the size of the bootstrap sample is chosen to be equal to the number of 
samples in the original training set, which usually provides a good bias-variance 
tradeoff. For the number of features d at each split, we want to choose a value that 1s 
smaller than the total number of features in the training set. A reasonable default that 


is used in scikit-learn and other implementations is d= m , where m is the number 
of features 1n the training set. 


Conveniently, we don't have to construct the random forest classifier from individual 
decision trees by ourselves because there is already an implementation in scikit-learn 
that we can use: 


>>> from sklearn.ensemble import RandomForestClassifier 
>>> forest = RandomForestClassifier(criterion="gini', 

i @Stimalors=25, 
rancom state=l1, 

os iY JODsS=2) 

SS PO Sc ee tele, teats VY era) 
Per PLO’ -CSCiLS10N Feqi1ons (xX Combineo, yy Combined, 

a Classii ter -Lorest, Test TOx—renge (105,150) ) 
>>> plt.xlabel ('petal length') 

>>> plt.ylabel ('petal width') 

>>> plt.legend(loc="upper left') 

>>> plt.show() 


After executing the preceding code, we should see the decision regions formed by 
the ensemble of trees in the random forest, as shown in the following figure: 
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Using the preceding code, we trained a random forest from 25 decision trees via the 
n estimators parameter and used the entropy criterion as an impurity measure to 
split the nodes. Although we are growing a very small random forest from a very 
small training dataset, we used the n jobs parameter for demonstration purposes, 
which allows us to parallelize the model training using multiple cores of our 
computer (here two cores). 
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K-nearest neighbors — a lazy learning 
algorithm 


The last supervised learning algorithm that we want to discuss 1n this chapter is the 
k-nearest neighbor (KANN) classifier, which 1s particularly interesting because it is 
fundamentally different from the learning algorithms that we have discussed so far. 


KNN is a typical example of a lazy learner. It is called /azy not because of its 
apparent simplicity, but because it doesn't learn a discriminative function from the 
training data, but memorizes the training dataset instead. 


Note 


Parametric versus nonparametric models 


Machine learning algorithms can be grouped into parametric and nonparametric 
models. Using parametric models, we estimate parameters from the training dataset 
to learn a function that can classify new data points without requiring the original 
training dataset anymore. Typical examples of parametric models are the perceptron, 
logistic regression, and the linear SVM. In contrast, nonparametric models can't be 
characterized by a fixed set of parameters, and the number of parameters grows with 
the training data. Two examples of non-parametric models that we have seen so far 
are the decision tree classifier/random forest and the kernel SVM. 


KNN belongs to a subcategory of nonparametric models that is described as 
instance-based learning. Models based on instance-based learning are characterized 
by memorizing the training dataset, and lazy learning 1s a special case of instance- 
based learning that 1s associated with no (zero) cost during the learning process. 


The KNN algorithm itself is fairly straightforward and can be summarized by the 
following steps: 


1. Choose the number of & and a distance metric. 
2. Find the k-nearest neighbors of the sample that we want to classify. 
3. Assign the class label by majority vote. 


The following figure illustrates how a new data point (?) is assigned the triangle 
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class label based on majority voting among its five nearest neighbors. 









Predict 


@Q)= A 


Based on the chosen distance metric, the KNN algorithm finds the & samples in the 
training dataset that are closest (most similar) to the point that we want to classify. 
The class label of the new data point is then determined by a majority vote among its 
A nearest neighbors. 


The main advantage of such a memory-based approach is that the classifier 
immediately adapts as we collect new training data. However, the downside is that 
the computational complexity for classifying new samples grows linearly with the 
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number of samples in the training dataset in the worst-case scenario—unless the 
dataset has very few dimensions (features) and the algorithm has been implemented 
using efficient data structures such as KD-trees. An Algorithm for Finding Best 
Matches in Logarithmic Expected Time, J. H. Friedman, J. L. Bentley, and R.A. 
Finkel, ACM transactions on mathematical software (TOMS), 3(3): 209-226, 1977. 
Furthermore, we can't discard training samples since no training step 1s involved. 
Thus, storage space can become a challenge if we are working with large datasets. 


By executing the following code, we will now implement a KNN model in scikit- 
learn using a Euclidean distance metric: 


>>> from sklearn.neighbors import KNeighborsClassifier 
yee Ai = KNGrOnborSsClassi 1 erin. net nbors=., p=Z; 

ae metric='minkowsk1"') 

Per KNNeELC (A. Crain Sto, y tCrein) 
wo PLO CSc1s1oOn. regions (xX combined Sto, Y-comoined, 

or Classiiiaer—knn, test. sgx=range (105,.150),) 
>>> plt.xlabel('petal length [standardized] ') 

>>> plt.ylabel('petal width [standardized] ') 

>>> plt.legend(loc="upper left') 

>>> plt.show() 


By specifying five neighbors in the KNN model for this dataset, we obtain a 
relatively smooth decision boundary, as shown in the following figure: 
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Note 


In the case of a tie, the scikit-learn implementation of the KNN algorithm will prefer 
the neighbors with a closer distance to the sample. If the neighbors have similar 
distances, the algorithm will choose the class label that comes first in the training 
dataset. 


The right choice of k is crucial to find a good balance between overfitting and 
underfitting. We also have to make sure that we choose a distance metric that is 
appropriate for the features in the dataset. Often, a simple Euclidean distance 
measure 1s used for real-value samples, for example, the flowers in our Iris dataset, 
which have features measured in centimeters. However, if we are using a Euclidean 
distance measure, it 1s also important to standardize the data so that each feature 
contributes equally to the distance. The minkowski distance that we used 1n the 
previous code 1s just a generalization of the Euclidean and Manhattan distance, 
which can be written as follows: 
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fp 


} . {oe 





It becomes the Euclidean distance 1f we set the parameter p=2 or the Manhattan 
distance at p=1. Many other distance metrics are available in scikit-learn and can be 
provided to the metric parameter. They are listed at http://scikit- 


learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric. html. 
Note 


The curse of dimensionality 


It is important to mention that KNN 1s very susceptible to overfitting due to the 
curse of dimensionality. The curse of dimensionality describes the phenomenon 
where the feature space becomes increasingly sparse for an increasing number of 
dimensions of a fixed-size training dataset. Intuitively, we can think of even the 
closest neighbors being too far away 1n a high-dimensional space to give a good 
estimate. 


We have discussed the concept of regularization in the section about logistic 
regression as one way to avoid overfitting. However, in models where regularization 
is not applicable, such as decision trees and KNN, we can use feature selection and 
dimensionality reduction techniques to help us avoid the curse of dimensionality. 
This will be discussed in more detail in the next chapter. 
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Summary 


In this chapter, you learned about many different machine learning algorithms that 
are used to tackle linear and nonlinear problems. We have seen that decision trees 
are particularly attractive if we care about interpretability. Logistic regression is not 
only a useful model for online learning via stochastic gradient descent, but also 
allows us to predict the probability of a particular event. Although support vector 
machines are powerful linear models that can be extended to nonlinear problems via 
the kernel trick, they have many parameters that have to be tuned 1n order to make 
good predictions. In contrast, ensemble methods such as random forests don't require 
much parameter tuning and don't overfit as easily as decision trees, which makes 
them attractive models for many practical problem domains. The KNN classifier 
offers an alternative approach to classification via lazy learning that allows us to 
make predictions without any model training, but with a more computationally 
expensive prediction step. 


However, even more important than the choice of an appropriate learning algorithm 
is the available data 1n our training dataset. No algorithm will be able to make good 
predictions without informative and discriminatory features. 


In the next chapter, we will discuss important topics regarding the preprocessing of 
data, feature selection, and dimensionality reduction, which we will need to build 
powerful machine learning models. Later in Chapter 6, Learning Best Practices for 
Model Evaluation and Hyperparameter Tuning, we will see how we can evaluate 
and compare the performance of our models and learn useful tricks to fine-tune the 
different algorithms. 
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Chapter 4. Building Good Training Sets 
— Data Preprocessing 


The quality of the data and the amount of useful information that it contains are key 
factors that determine how well a machine learning algorithm can learn. Therefore, it 
is absolutely critical that we make sure to examine and preprocess a dataset before 
we feed it to a learning algorithm. In this chapter, we will discuss the essential data 
preprocessing techniques that will help us build good machine learning models. 


The topics that we will cover in this chapter are as follows: 


e Removing and imputing missing values from the dataset 
e Getting categorical data into shape for machine learning algorithms 
e Selecting relevant features for the model construction 


WOW! eBook 
www.wowebook.org 


Dealing with missing data 


It is not uncommon in real-world applications for our samples to be missing one or 
more values for various reasons. There could have been an error in the data 
collection process, certain measurements are not applicable, or particular fields could 
have been simply left blank in a survey, for example. We typically see missing 
values as the blank spaces in our data table or as placeholder strings such as Nan, 
which stands for not a number, or NULL (a commonly used indicator of unknown 
values in relational databases). 


Unfortunately, most computational tools are unable to handle such missing values, or 
produce unpredictable results if we simply ignore them. Therefore, it 1s crucial that 
we take care of those missing values before we proceed with further analyses. In this 
section, we will work through several practical techniques for dealing with missing 
values by removing entries from our dataset or imputing missing values from other 
samples and features. 
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Identifying missing values in tabular data 


But before we discuss several techniques for dealing with missing values, let's create 
a simple example data frame from a Comma-separated Values (CSV) file to get a 
better grasp of the problem: 


>>> import pandas as pd 
Po LLOMm LO A2Mpore- StLinglo 


>>> csv data = \ 
'''A BLC,D 
l, Op Zaps .05e4.0 
ep Oey pO 30 
Jee Loop tile, IZ, 
>>> # If you are using Python 2.7, you need 
Poo & TO COnVere the String (Oo Unmecode: 


>>> # csv data = unicode(csv_ data) 
eo> Of = pd.read csv (sStringlO(csv data) ) 
ee Ole 
A B CS D 
0 bet 2el- 3.0 4.40 
1 Se) 6.40 NaN 6.0 
210.0 11.0 12.0 NaN 


Using the preceding code, we read CSV-formatted data into a pandas DataFrame via 
the read _ csv function and noticed that the two missing cells were replaced by Nan. 
The stringIo function 1n the preceding code example was simply used for the 
purposes of illustration. It allows us to read the string assigned to csv data into a 
pandas DataFrame as if it was a regular CSV file on our hard drive. 


For a larger DataFrame, it can be tedious to look for missing values manually; 1n this 
case, we can use the isnul1 method to return a DataFrame with Boolean values that 

indicate whether a cell contains a numeric value (False) or 1f data is missing (True). 
Using the sum method, we can then return the number of missing values per column 

as follows: 


>>> df.isnull().sum() 


A 0 
B 0 
C il 
D al 


dtype: into4 
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This way, we can count the number of missing values per column; 1n the following 
subsections, we will take a look at different strategies for how to deal with this 
missing data. 


Note 


Although scikit-learn was developed for working with NumPy arrays, it can 
sometimes be more convenient to preprocess data using pandas' DataFrame. We can 
always access the underlying NumPy array of a DataFrame via the values attribute 
before we feed it into a scikit-learn estimator: 


>>> dfi.values 

array([[ 1., a Sey 4.1, 
[wes 6., Nan, Ce. | z 
eer eo oy 12., nan]]) 
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Eliminating samples or features with missing 
values 


One of the easiest ways to deal with missing data is to simply remove the 
corresponding features (columns) or samples (rows) from the dataset entirely; rows 
with missing values can be easily dropped via the dropna method: 


>>> df.dropna(axis=0) 
A B C D 
0 1.0 2.0 3.0 4.0 
Similarly, we can drop columns that have at least one Nan in any row by setting the 


axis argument to 1: 


>>> df.dropna(axis=1) 


A B 
Cc tae 2a 
cL oe9 Ge 


A toed iid 


The dropna method supports several additional parameters that can come in handy: 


# only drop rows where all columns are NaN 
# (returns the whole array here since we don't 
# have a row with where all values are NaN 
>>> df.dropna (how='all') 
A B 3: D 


S 230 2.0 2.9 2.0 
iL 20 ©.0 Nan O20 
Z, Ae ded AZo! NaN 


# drop rows that have less than 4 real values 
>>> df.dropna (thresh=4) 

A B & D 
O Le 2eO0 Su 2,30 


# only drop rows where NaN appear in specific columns (here: 'C') 
>>> df.dropna(subset=['C']) 
A B ( D 
OC aha: Zar 2:0: 42.0 
Z LO. dieQ 2220 Nan 


WOW! eBook 
www.wowebook.org 


Although the removal of missing data seems to be a convenient approach, it also 
comes with certain disadvantages; for example, we may end up removing too many 
samples, which will make a reliable analysis impossible. Or, 1f we remove too many 
feature columns, we will run the risk of losing valuable information that our 
classifier needs to discriminate between classes. In the next section, we will thus 
look at one of the most commonly used alternatives for dealing with missing values: 
interpolation techniques. 
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Imputing missing values 


Often, the removal of samples or dropping of entire feature columns is simply not 
feasible, because we might lose too much valuable data. In this case, we can use 
different interpolation techniques to estimate the missing values from the other 
training samples in our dataset. One of the most common interpolation techniques is 
mean imputation, where we simply replace the missing value with the mean value 
of the entire feature column. A convenient way to achieve this is by using the 
Imputer Class from scikit-learn, as shown 1n the following code: 


>>> from sklearn.preprocessing import Imputer 

>>> imr = Imputer (missing values='NaN', strategy='mean', ax1is=0) 

>>> imr = imr.fit(df.values) 

veo IMpuled Gata = 1Mrstranstorm(dr.values) 

>>> imputed data 

array([[ 1., Tay Sey 
ore Sap. 2809 
[ Oey dese 172, 


-l, 
-l, 
-J]) 


Ov CO WW 


Here, we replaced each Nan value with the corresponding mean, which is separately 
calculated for each feature column. If we changed the axis=o setting to axis=1, we'd 
calculate the row means. Other options for the strategy parameter are median or 
most frequent, where the latter replaces the missing values with the most frequent 
values. This is useful for imputing categorical feature values, for example, a feature 
column that stores an encoding of color names, such as red, green, and blue, and we 
will encounter examples of such data later in this chapter. 
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Understanding the scikit-learn estimator API 


In the previous section, we used the Imputer class from scikit-learn to impute 
missing values in our dataset. The Imputer class belongs to the so-called 
transformer classes in scikit-learn, which are used for data transformation. The two 
essential methods of those estimators are fit and transform. The fit method is 
used to learn the parameters from the training data, and the transform method uses 
those parameters to transform the data. Any data array that is to be transformed 
needs to have the same number of features as the data array that was used to fit the 
model. The following figure illustrates how a transformer, fitted on the training data, 
is used to transform a training dataset as well as a new test dataset: 


Training 
Data 








est. fit(X_train) 








| 


(2) est.transform(X_train) | est. transform(X_test) (3) 





Transformed Transformed 





Training Data Test Data 


The classifiers that we used in Chapter 3, A Tour of Machine Learning Classifiers 
Using scikit-learn, belong to the so-called estimators in scikit-learn with an API that 
is conceptually very similar to the transformer class. Estimators have a predict 
method but can also have a transform method, as we will see later in this chapter. 
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As you may recall, we also used the £it method to learn the parameters of a model 
when we trained those estimators for classification. However, in supervised learning 
tasks, we additionally provide the class labels for fitting the model, which can then 
be used to make predictions about new data samples via the predict method, as 
illustrated in the following figure: 
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Training Training 
Data Labels 


est. fit(X train, y train) 


Predicted 
labels 
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Handling categorical data 


So far, we have only been working with numerical values. However, it is not 
uncommon that real-world datasets contain one or more categorical feature columns. 
In this section, we will make use of simple yet effective examples to see how we 
deal with this type of data in numerical computing libraries. 


WOW! eBook 
www.wowebook.org 


Nominal and ordinal features 


When we are talking about categorical data, we have to further distinguish between 
nominal and ordinal features. Ordinal features can be understood as categorical 
values that can be sorted or ordered. For example, t-shirt size would be an ordinal 
feature, because we can define an order XL > L > M. In contrast, nominal features 
don't imply any order and, to continue with the previous example, we could think of 
t-shirt color as a nominal feature since it typically doesn't make sense to say that, for 
example, red is larger than blue. 


Creating an example dataset 


Before we explore different techniques to handle such categorical data, let's create a 
new DataFrame to illustrate the problem: 


>>> import pandas as pd 

>>> af = pd.DataFrame ([ 

[*green”*, "“M*, 20.1, "classi", 
[red y li, 13.5, “ClhassZz* |, 

P blues “<li, loss, “~Clasel* 7] 


>>> CL.CcoLumns ['color', 'size', 'price', 'classlabel'"] 


>>> Or 

color size price classlabel 
O green M M6 eel Clase 
1 red L es ee classz 
Z blue XL (Regs classl 


As we can see 1n the preceding output, the newly created DataFrame contains a 
nominal feature (color), an ordinal feature (size), and a numerical feature (price) 
column. The class labels (assuming that we created a dataset for a supervised 
learning task) are stored 1n the last column. The learning algorithms for classification 
that we discuss in this book do not use ordinal information in class labels. 
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Mapping ordinal features 


To make sure that the learning algorithm interprets the ordinal features correctly, we 
need to convert the categorical string values into integers. Unfortunately, there is no 
convenient function that can automatically derive the correct order of the labels of 
our size feature, so we have to define the mapping manually. In the following 
simple example, let's assume that we know the numerical difference between 
features, for example, 4/=L+l=M+2; 


Po > SIZe Mapping = { 


lees oy 
vie’ Se 
ae "M': 1} 
eee OL | S176" | = Ob)" si7e* | Map (size Mapping) 
oo AL 
color size price classlabel 
O green i al hse al class 
1 red 2 iis Peo) class2 
Z blue 3 igerys, classl 


If we want to transform the integer values back to the original string representation at 
a later stage, we can simply define a reverse-mapping dictionary inv size mapping 
= {v: k for k, v in size mapping.items () } that can then be used via the 
pandas map method on the transformed feature column, similar to the size mapping 
dictionary that we used previously. We can use it as follows: 


Por INV Size Mapping = #Vi K LOr K, V 2n saze@ Mapping: 1tems() 
Jor OT), SLZe" | «Map (1ny S176. Mapping) 

0 M 

fl L 

eZ XL 


Name: size, dtype: object 
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Encoding class labels 


Many machine learning libraries require that class labels are encoded as integer 
values. Although most estimators for classification in scikit-learn convert class labels 
to integers internally, it is considered good practice to provide class labels as integer 
arrays to avoid technical glitches. To encode the class labels, we can use an approach 
similar to the mapping of ordinal features discussed previously. We need to 
remember that class labels are not ordinal, and it doesn't matter which integer 
number we assign to a particular string label. Thus, we can simply enumerate the 
class labels, starting at 0: 


>>> import numpy as np 

ee Class. Mapping = {label s1dx for 120x,labek in 

or enumerate (np.unigque(df['classlabel']))} 
o> Class Mapping 
{'classl': 0, 'class2': 1} 


Next, we can use the mapping dictionary to transform the class labels into integers: 


por OT | Class Lave.” | = Cll Classtabel” | «Map(class Mapping) 
aoe Oe 
color size price classlabel 
O green dL als eae 0 
1 red 2 il ee) HE 
2 blue S. ores 0 


We can reverse the key-value pairs in the mapping dictionary as follows to map the 
converted class labels back to the original string representation: 


pee IY Class Mapping = {ve hk for Ky Vv in. Class Mapping.2tems() | 
PoP OF | "Class labe..” | = Grrl class a0el” |.map(inv Class mapping) 
ao > GE 

color size price classlabel 
O green de ilk 6 eel class] 
1 red 2 Leo ClaesZ 
Z, blue ©. kop Clasel 


Alternatively, there is a convenient LabelEncoder Class directly implemented in 
scikit-learn to achieve this: 


>>> from sklearn.preprocessing import LabelEncoder 

eee Claes ee = Jee LENCOder () 

veo yy = Claes JO. t il -eanslorm (dr | *Classlabel” | «values) 
Zoe 
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array ( (0, 13 07) 


Note that the fit transform method 1s just a shortcut for calling fit and transform 
separately, and we can use the inverse transform method to transform the integer 
class labels back into their original string representation: 


Por Class leé.ianverse transtorm(y) 
array(['classl', 'class2', ‘classl'], dtype=object) 
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Performing one-hot encoding on nominal 
features 


In the previous section, we used a simple dictionary-mapping approach to convert 
the ordinal size feature into integers. Since scikit-learn's estimators for classification 
treat class labels as categorical data that does not imply any order (nominal), we used 
the convenient LabelEncoder to encode the string labels into integers. It may appear 
that we could use a similar approach to transform the nominal color column of our 
dataset, as follows: 


>>> X = df[['color', ‘size', ‘price']].values 
PO COLOLr 10 = lave Encoder () 
Zor wae Ol = COMOt 1627 Le Crone rorni xe, Ul) 
Peo XK 
array([[1, 1, 10.1] 

[Ze 2p eed 


[0O, 3, 15.3]], dtype=object) 


After executing the preceding code, the first column of the NumPy array x now 
holds the new color values, which are encoded as follows: 


® blue=0 
® green=1 


® red=2 


If we stop at this point and feed the array to our classifier, we will make one of the 
most common mistakes in dealing with categorical data. Can you spot the problem? 
Although the color values don't come in any particular order, a learning algorithm 
will now assume that green 1s larger than blue, and red 1s larger than green. 
Although this assumption is incorrect, the algorithm could still produce useful 
results. However, those results would not be optimal. 


A common workaround for this problem is to use a technique called one-hot 
encoding. The idea behind this approach is to create a new dummy feature for each 
unique value in the nominal feature column. Here, we would convert the color 
feature into three new features: blue, green, and red. Binary values can then be used 
to indicate the particular color of a sample; for example, a blue sample can be 
encoded as blue=1, green=0, red=0. To perform this transformation, we can use the 
OneHotEncoder that is implemented in the scikit-learn.preprocessing module: 
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>>> from sklearn.preprocessing import OneHotEncoder 


PoP Oue = OmehOemncoder (cCalegorit ca Tealtures—|0)) 
por ONesEit Clans lorm (x) «tOarray() 
array([[ 0. , ike @ O. , de gy 20st, 


i: Oe y Ue. ¥ he 4 Le gg dosnt 
| die. ¥ Va ¥ Se ¥ Se gf doe) 


When we initialized the oneHotEncoder, we defined the column position of the 
variable that we want to transform via the categorical features parameter (note 
that color 1s the first column in the feature matrix x). By default, the oneHotEncoder 
returns a sparse matrix when we use the transform method, and we converted the 
sparse matrix representation into a regular (dense) NumPy array for the purpose of 
visualization via the toarray method. Sparse matrices are a more efficient way of 
storing large datasets and one that 1s supported by many scikit-learn functions, which 
is especially useful if an array contains a lot of zeros. To omit the toarray step, we 
could alternatively initialize the encoder as OneHotEncoder(..., sparse=False) to 
return a regular NumPy array. 


An even more convenient way to create those dummy features via one-hot encoding 
is to use the get dummies method implemented in pandas. Applied to a DataFrame, 
the get dummies method will only convert string columns and leave all other 
columns unchanged: 


veo? pd/OSc. Cummves (or | |*pruce’, *color, “srze* | 1) 
Prvce S176 Color Dive color Greeti color red 


0 TOs al 1 0 il 0 
al ike ere. 2 0 0 1 
Z, ile Pee 5 dl 0 0 


When we are using one-hot encoding datasets, we have to keep in mind that it 
introduces multicollinearity, which can be an issue for certain methods (for instance, 
methods that require matrix inversion). If features are highly correlated, matrices are 
computationally difficult to invert, which can lead to numerically unstable estimates. 
To reduce the correlation among variables, we can simply remove one feature 
column from the one-hot encoded array. Note that we do not lose any important 
information by removing a feature column, though; for example, if we remove the 
column color blue, the feature information 1s still preserved since 1f we observe 
color green=0 and color red=0, it implies that the observation must be blue. 


If we use the get dummies function, we can drop the first column by passing a True 
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argument to the drop first parameter, as shown in the following code example: 


ao> Pa.Get, Gummaes(drt|i*pruce’;,;, “color; “size il, 
Gropp Tirse=i rue) 
Petes S176 Celer Geen Color 7e0 


0 IO ates dle i 0 
1 ike Pas Z 0 il 
Z ib opm: 3 0 0 


The OneHotEncoder does not have a parameter for column removal, but we can 
simply slice the one-hot encoded NumPy array as shown in the following code 
snippet: 


ohe = Onehochncoder (categorical fearures=|0)) 
ohe.fit. Cranstorm(*).toarray().[is, l=] 
array ([ Ie yp , he @ Ose 


T 0. 
F © » ts » Ose 23.57, 
r O. ,  O. , 3. ,  15.3))]) 


WOW! eBook 
www.wowebook.org 


Partitioning a dataset into separate 
training and test sets 


We briefly introduced the concept of partitioning a dataset into separate datasets for 
training and testing in Chapter 1, Giving Computers the Ability to Learn from Data, 
and Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. 
Remember that comparing predictions to true labels 1n the test set can be understood 
as the unbiased performance evaluation of our model before we let it loose on the 
real world. In this section, we will prepare a new dataset, the Wine dataset. After we 
have preprocessed the dataset, we will explore different techniques for feature 
selection to reduce the dimensionality of a dataset. 


The Wine dataset is another open-source dataset that is available from the UCI 
machine learning repository (https://archive.ics.uci.edu/ml/datasets/Wine); 1t consists 
of 178 wine samples with 13 features describing their different chemical properties. 


Note 


You can find a copy of the Wine dataset (and all other datasets used in this book) in 
the code bundle of this book, which you can use if you are working offline or the 
dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data 
is temporarily unavailable on the UCI server. For instance, to load the Wine dataset 
from a local directory, you can replace this line: 


df = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases/wine/wine.data', 
header=None) 


Note 


Replace it with this: 


df = pd.read csv('your/local/path/to/wine.data', 
header=None) 


Using the pandas library, we will directly read in the open source Wine dataset from 
the UCI machine learning repository: 


>>> df wine = pd.read csv('https://archive.ics.uci.edu/' 
'ml/machine-learning-databases/' 
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'wine/wine.data', header=None) 
Por Ot Wine.~COLumnis = |"Class lebel”*, “ALConol*, 
"Malic acid', '‘Ash', 
"Alcalinity of ash', 'Magnesium!', 
"Total phenols', 'Flavanoids', 
"'Nonflavanoid phenols', 
‘Proanthocyanins’, 
"Color intensity', ‘'Hue', 
'OD280/0D315 of diluted wines', 
a ‘Proline? | 
per Prine Class labels, Dp. Unogue(dt wine | "Class label” ].)) 
Class labels [1 2 3] 
Zo? Gt wine.head() 


The 13 different features in the Wine dataset, describing the chemical properties of 
the 178 wine samples, are listed in the following table: 


| | 280/0D315 
Alcalini Nonfla anoid 
a |Magnesium Flavanoids bala Prssithocyannna |" = diluted 
of ash | phenols inten Ss 
| wines 
2 18.6 ?.80 4.24 0. 2.81 5.68 1.03 | 3.17 
Ht 
1.95 2.50 16.8 3.85 3.49 2.18 7.80 0.86 -—— 











The samples belong to one of three different classes, 1, 2, and 3, which refer to the 
three different types of grape grown 1n the same region in Italy but derived from 
different wine cultivars, as described in the dataset summary 
(https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). 





A convenient way to randomly partition this dataset into separate test and training 
datasets is to use the train test split function from scikit-learn's 
model selection submodule: 


pee trom sklearn.model Seleclionm Import. train Cost Split 
eee my Yo = OF wine. 1OC(2, Lil svalues;, GL wine.t1ocl+, Ulevalues 
>>> X train, X test, y train, y test =\ 
tleain test. splitiy, ¥, 
Lest. .176=0 40, 
random state=0, 
stratify=y) 
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First, we assigned the NumPy array representation of the feature columns 1-13 to the 
variable x; we assigned the class labels from the first column to the variable y. Then, 
we used the train test split function to randomly split x and y into separate 
training and test datasets. By setting test size=0.3, we assigned 30 percent of the 
wine samples to x test and y test, and the remaining 70 percent of the samples 
were assigned to x train andy train, respectively. Providing the class label array 
y as an argument to stratify ensures that both training and test datasets have the 
same class proportions as the original dataset. 


Note 


If we are dividing a dataset into training and test datasets, we have to keep in mind 
that we are withholding valuable information that the learning algorithm could 
benefit from. Thus, we don't want to allocate too much information to the test set. 
However, the smaller the test set, the more inaccurate the estimation of the 
generalization error. Dividing a dataset into training and test sets 1s all about 
balancing this trade-off. In practice, the most commonly used splits are 60:40, 70:30, 
or 80:20, depending on the size of the initial dataset. However, for large datasets, 
90:10 or 99:1 splits into training and test subsets are also common and appropriate. 
Instead of discarding the allocated test data after model training and evaluation, it is 
a common practice to retrain a classifier on the entire dataset as it can improve the 
predictive performance of the model. While this approach is generally 
recommended, it could lead to worse generalization performance if the dataset 1s 
small and the test set contains outliers, for example. Also, after refitting the model on 
the whole dataset, we don't have any independent data left to evaluate its 
performance. 
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Bringing features onto the same scale 


Feature scaling is a crucial step in our preprocessing pipeline that can easily be 
forgotten. Decision trees and random forests are two of the very few machine 
learning algorithms where we don't need to worry about feature scaling. Those 
algorithms are scale invariant. However, the majority of machine learning and 
optimization algorithms behave much better if features are on the same scale, as we 
have seen in Chapter 2, Training Simple Machine Learning Algorithms for 
Classification, when we implemented the gradient descent optimization algorithm. 


The importance of feature scaling can be illustrated by a simple example. Let's 
assume that we have two features where one feature is measured on a scale from | to 
10 and the second feature 1s measured on a scale from | to 100,000, respectively. 
When we think of the squared error function in Adaline in Chapter 2, Training 
Simple Machine Learning Algorithms for Classification, it 1s intuitive to say that the 
algorithm will mostly be busy optimizing the weights according to the larger errors 
in the second feature. Another example is the k-nearest neighbors (KNN) algorithm 
with a Euclidean distance measure; the computed distances between samples will be 
dominated by the second feature axis. 


Now, there are two common approaches to bring different features onto the same 

scale: normalization and standardization. Those terms are often used quite loosely 

in different fields, and the meaning has to be derived from the context. Most often, 

normalization refers to the rescaling of the features to a range of [0, 1], which 1s a 

special case of min-max scaling. To normalize our data, we can simply apply the 
At) 

min-max scaling to each feature column, where the new value "°’”’ of a sample 


(i) 
‘can be calculated as follows: 


At) 
|) ne x A 


~ FOTN : . 
Meow: 


“max “min 
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(i) 


Here, * is aparticular sample, “™" is the smallest value in a feature column, and 


“max the largest value. 


The min-max scaling procedure is implemented in scikit-learn and can be used as 
follows: 


>>> from sklearn.preprocessing import MinMaxScaler 
>>> mms = MinMaxScaler () 

>>> X train norm = mms.fit transform(X train) 

Por x vest. Norm = Mms.Crancrormix TSese) 


Although normalization via min-max scaling 1s a commonly used technique that is 
useful when we need values in a bounded interval, standardization can be more 
practical for many machine learning algorithms, especially for optimization 
algorithms such as gradient descent. The reason is that many linear models, such as 
the logistic regression and SVM that we remember from Chapter 3, A Tour of 
Machine Learning Classifiers Using scikit-learn, initialize the weights to 0 or small 
random values close to 0. Using standardization, we center the feature columns at 
mean 0 with standard deviation | so that the feature columns takes the form of a 
normal distribution, which makes it easier to learn the weights. Furthermore, 
standardization maintains useful information about outliers and makes the algorithm 
less sensitive to them in contrast to min-max scaling, which scales the data to a 
limited range of values. 


The procedure for standardization can be expressed by the following equation: 


| Af) 
i) 4X fi, 


x) = 
Oo 


* sta 
¥ 


Ll 


Here, “ * is the sample mean of a particular feature column and * is the 
corresponding standard deviation. 


The following table illustrates the difference between the two commonly used 
feature scaling techniques, standardization and normalization, on a simple sample 
dataset consisting of numbers 0 to 5: 
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a 





You can perform the standardization and normalization shown in the table manually 
by executing the following code examples: 


Poo Cx. = NP.array( [0, ty 2, 3, 2, 3)) 


>>> print ('standardized:', (ex - ex.mean()) / ex.std()) 
Stenderdizeda: [=l,4030S50L) =—O0uc763l1l007 —0.29277002 U.2Z29Z2Z77002 
0.87831007 1.46385011] 

>>> print ('normalized:', (ex - ex.min()) / (ex.max() - ex.min())) 
normalized: [ O. Vaz Wet Wee Woe ae | 


Similar to the MinMaxScaler class, scikit-learn also implements a class for 
standardization: 


>>> from sklearn.preprocessing import StandardScaler 
>>> stdsc = StandardsScaler () 

yo? 2. Vdin Std = Stcse.f1t Teeanstorm (x train) 

Poe K& Gest. SLO = SLOSC.Cranslorm( x. Lest) 


Again, it is also important to highlight that we fit the standardScaler class only 
once—on the training data—and use those parameters to transform the test set or any 
new data point. 
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Selecting meaningful features 


If we notice that a model performs much better on a training dataset than on the test 
dataset, this observation 1s a strong indicator of overfitting. As we discussed in 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, overfitting 
means the model fits the parameters too closely with regard to the particular 
observations in the training dataset, but does not generalize well to new data, and we 
say the model has a high variance. The reason for the overfitting is that our model 1s 
too complex for the given training data. Common solutions to reduce the 
generalization error are listed as follows: 


Collect more training data 

Introduce a penalty for complexity via regularization 
Choose a simpler model with fewer parameters 
Reduce the dimensionality of the data 


Collecting more training data is often not applicable. In Chapter 6, Learning Best 
Practices for Model Evaluation and Hyperparameter Tuning, we will learn about a 
useful technique to check whether more training data is helpful at all. In the 
following sections, we will look at common ways to reduce overfitting by 
regularization and dimensionality reduction via feature selection, which leads to 
simpler models by requiring fewer parameters to be fitted to the data. 
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L1 and L2 regularization as penalties against 


model complexity 
We recall from Chapter 3, 4 Tour of Machine Learning Classifiers Using scikit- 
learn, that L2 regularization is one approach to reduce the complexity of a model 


by penalizing large individual weights, where we defined the L2 norm of our weight 
vector w as follows: 


" HM} 
tea, |||, = » Ww 
a 


Another approach to reduce the model complexity 1s the related L1 regularization: 


hi 


= 2m, 





1: | W 


Here, we simply replaced the square of the weights by the sum of the absolute values 
of the weights. In contrast to L2 regularization, L1 regularization usually yields 
sparse feature vectors; most feature weights will be zero. Sparsity can be useful in 
practice if we have a high-dimensional dataset with many features that are irrelevant, 
especially cases where we have more irrelevant dimensions than samples. In this 
sense, L1 regularization can be understood as a technique for feature selection. 
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A geometric interpretation of L2 regularization 


As mentioned in the previous section, L2 regularization adds a penalty term to the 
cost function that effectively results in less extreme weight values compared to a 
model trained with an unregularized cost function. To better understand how L1 
regularization encourages sparsity, let's take a step back and take a look ata 
geometric interpretation of regularization. Let us plot the contours of a convex cost 


function for two weight coefficients "1 and “a . Here, we will consider the Sum of 
Squared Errors (SSE) cost function that we used for Adaline in Chapter 2, Training 
Simple Machine Learning Algorithms for Classification, since it is spherical and 
easier to draw than the cost function of logistic regression; however, the same 
concepts apply to the latter. Remember that our goal 1s to find the combination of 
weight coefficients that minimize the cost function for the training data, as shown in 
the following figure (the point in the center of the ellipses): 


W) 


—y 
Ko 





hh 








WOW! eBook 
www.wowebook.org 


Now, we can think of regularization as adding a penalty term to the cost function to 
encourage smaller weights; or in other words, we penalize large weights. 


Thus, by increasing the regularization strength via the regularization parameter A 
we shrink the weights towards zero and decrease the dependence of our model on the 
training data. Let us illustrate this concept in the following figure for the L2 penalty 
term: 


Wo 





| —_ | Minimize cost 


—> 





Minimize cost + penalty 


Minimize penalty 





The quadratic L2 regularization term is represented by the shaded ball. Here, our 
weight coefficients cannot exceed our regularization budget—the combination of the 
weight coefficients cannot fall outside the shaded area. On the other hand, we still 
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want to minimize the cost function. Under the penalty constraint, our best effort 1s to 
choose the point where the L2 ball intersects with the contours of the unpenalized 


cost function. The larger the value of the regularization parameter A gets, the faster 
the penalized cost grows, which leads to a narrower L2 ball. For example, if we 
increase the regularization parameter towards infinity, the weight coefficients will 
become effectively zero, denoted by the center of the L2 ball. To summarize the 
main message of the example, our goal 1s to minimize the sum of the unpenalized 
cost plus the penalty term, which can be understood as adding bias and preferring a 
simpler model to reduce the variance in the absence of sufficient training data to fit 
the model. 
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Sparse solutions with L1 regularization 


Now, let us discuss L1 regularization and sparsity. The main concept behind L1 
regularization 1s similar to what we have discussed in the previous section. However, 
since the L1 penalty is the sum of the absolute weight coefficients (remember that 
the L2 term is quadratic), we can represent it as a diamond-shape budget, as shown 
in the following figure: 


W> 
Minimize cost 


SS 









Minimize cost + penalty 
(w, = 0) 


Minimize penalty 





In the preceding figure, we can see that the contour of the cost function touches the 


L1 diamond at "! ~ " . Since the contours of an L1 regularized system are sharp, it 
is more likely that the optimum—that 1s, the intersection between the ellipses of the 
cost function and the boundary of the L1 diamond—1s located on the axes, which 
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encourages sparsity. 


Note 


The mathematical details of why L1 regularization can lead to sparse solutions are 
beyond the scope of this book. If you are interested, an excellent explanation of L2 
versus L1 regularization can be found in Section 3.4, The Elements of Statistical 
Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer 
Science+Business Media, 2009). 


For regularized models in scikit-learn that support L1 regularization, we can simply 
set the penalty parameter to '11' to obtain a sparse solution: 


27 LeOMm sk bearn.iinear Model ampore GOGLStTLCReg1 cession 
>>> LogisticRegression (penalty='11') 


Applied to the standardized Wine data, the L1 regularized logistic regression would 
yield the following sparse solution: 


>>> lr = LogisticRegression (penalty='l1l', C=1.0) 

Peo ite Lem. eral ocd, Y tiain) 

Por Prine Training accuracy: , Wy .Score(x% trail Std, Y Tiaim)) 
Training accuracy: 1,0 

eer Prine" TSst accuracy: *, If,score(x% tect sta, Y vest) ) 

Test accuracy: 1.0 


Both training and test accuracies (both 100 percent) indicate that our model does a 
perfect job on both datasets. When we access the intercept terms via the 
lr.intercept attribute, we can see that the array returns three values: 


eo eed Lee e, 
atteay (i keZOlSoeol, “~“IhezglooZzUrl, =Zeo701055 |) 


Since we fit the LogisticRegression object on a multiclass dataset, it uses the One- 
versus-Rest (OvR) approach by default, where the first intercept belongs to the 
model that fits class 1 versus class 2 and 3, the second value is the intercept of the 
model that fits class 2 versus class 1 and 3, and the third value 1s the intercept of the 
model that fits class 3 versus class 1 and 2: 


vor ew COCr | 
abrteay (|| dezao59so7, Os Pe04lyo7], 0. 74320694, -LeloU4GzZ/ 7, Vs x 
Weg deltOrerit, Os, Us; Usy Uszy Us oetgelool, 2.0101 /4061, 
Peto Z ee, “V0 ZOU, “U.995o Zo, ULoooleT, 
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-0.0596352 , 0., 0.66833149, 0., 0., -1.9346134, 
1.23297955, 0., -2.23135027], 

[ 0.13579227, 0.16837686, 0.35723831, 0., 0., 0., 
-~2.43809275, , 0., 1.56391408, -0.81933286, 
-0.49187817, 1]) 


0. 
0. 
The weight array that we accessed via the 1r.coef_ attribute contains three rows of 
weight coefficients, one weight vector for each class. Each row consists of 13 
weights where each weight is multiplied by the respective feature in the 13- 
dimensional Wine dataset to calculate the net input: 


fit | 
Z= WX t+ Wi Xe =D. x,w,=Wwex 


f=0 


Note 


° ° ae ; YW ° i> 0 
In scikit-learn, “» corresponds to the intercept and =/ with’~ correspond to the 
values in coef . 


As aresult of L1 regularization, which serves as a method for feature selection, we 
just trained a model that is robust to the potentially irrelevant features in this dataset. 


Strictly speaking, the weight vectors from the previous example are not necessarily 
sparse, though, because they contain more non-zero than zero entries. However, we 
could enforce sparsity (more zero entries) by further increasing the regularization 
strength—that 1s, choosing lower values for the c parameter. 


In the last example on regularization in this chapter, we will vary the regularization 
strength and plot the regularization path—the weight coefficients of the different 
features for different regularization strengths: 


27> IMpOLL MatplotlitbD.pyolLor as: ple 


>>> fig = plt.figure() 
>>> ax = plt.subplot(111) 


>>> colors = ['blue', 'green', ‘'red', ‘cyan', 
"'magenta', ‘yellow', 'black', 
"pink', 'lightgreen', ‘lightblue', 
'Ooray'; “indigo'’, *CGrange’ | 
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>>> 
>>> 


>>> 


>>> 


>>> 
>>> 
>>> 
Le 
>>> 
Poe 
>>> 


>>> 


weights, params = [], [1] 
for Cc anh nNp.arange (=4., 6.) 
lr = LogisticRegression(penalty='l11', 
C= gee, 
random state=0) 
Itete Cie Uratt Sud, VF Lian) 
welgnts.appena(ir.coert [1] 
params.append(10**c) 


weights = np.array (weights) 


for column, color in zip(range(weights.shape[1]), colors): 
pLU.plLot (params, weraghts |<, column], 
label-ot Wwine.columns | column a 11, 
color=color) 
plt.axhline(0O, color='black', linestyle='--', linewidth=3) 
DLG eli CL LO (=o), LOe*>/)) 
plt.ylabel ('weight coefficient') 
plt.xlabel('C') 
plt.xscale('log') 
plt.legend(loc='upper left") 
ax.legend(loc='upper center', 
bobOx OO enchOor=(l.o¢, Io), 
ncol=l1, fancybox=True) 
pilt.show () 


The resulting plot provides us with further insights into the behavior of L1 
regularization. As we can see, all feature weights will be zero 1f we penalize the 


model with a strong regularization parameter (© <-!); C is the inverse of the 


regularization parameter A. 


WOW! eBook 
www.wowebook.org 


sthaaee! 
a 
wv 
i 
= 
) 
cc 
_) 
qt 
i 
oO 
cal] 
= 


WOW! eBook 
www.wowebook.org 


Alcohol 

Malic acid 

Ash 

Alcalinity of ash 
Magnesium 

Total phenols 
Flavanoids 
Nonflavanoid phenols 
Proanthocyanins 
Color intensity 

Hue 

0D280/0D315 of diluted wines 
Proline 





Sequential feature selection algorithms 


An alternative way to reduce the complexity of the model and avoid overfitting 1s 
dimensionality reduction via feature selection, which 1s especially useful for 
unregularized models. There are two main categories of dimensionality reduction 
techniques: feature selection and feature extraction. Via feature selection, we 
select a subset of the original features, whereas in feature extraction, we derive 
information from the feature set to construct a new feature subspace. 


In this section, we will take a look at a classic family of feature selection algorithms. 
In the next chapter, Chapter 5, Compressing Data via Dimensionality Reduction, we 
will learn about different feature extraction techniques to compress a dataset onto a 
lower-dimensional feature subspace. 


Sequential feature selection algorithms are a family of greedy search algorithms that 
are used to reduce an initial d-dimensional feature space to a k-dimensional feature 
subspace where k<d. The motivation behind feature selection algorithms 1s to 
automatically select a subset of features that are most relevant to the problem, to 
improve computational efficiency or reduce the generalization error of the model by 
removing irrelevant features or noise, which can be useful for algorithms that don't 
support regularization. 


A classic sequential feature selection algorithm is Sequential Backward Selection 
(SBS), which aims to reduce the dimensionality of the initial feature subspace with a 
minimum decay in performance of the classifier to improve upon computational 
efficiency. In certain cases, SBS can even improve the predictive power of the model 
if a model suffers from overfitting. 


Note 


Greedy algorithms make locally optimal choices at each stage of a combinatorial 
search problem and generally yield a suboptimal solution to the problem, in contrast 
to exhaustive search algorithms, which evaluate all possible combinations and are 
guaranteed to find the optimal solution. However, in practice, an exhaustive search is 
often computationally not feasible, whereas greedy algorithms allow for a less 
complex, computationally more efficient solution. 


The idea behind the SBS algorithm is quite simple: SBS sequentially removes 
features from the full feature subset until the new feature subspace contains the 
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desired number of features. In order to determine which feature is to be removed at 
each stage, we need to define the criterion function J that we want to minimize. The 
criterion calculated by the criterion function can simply be the difference in 
performance of the classifier before and after the removal of a particular feature. 
Then, the feature to be removed at each stage can simply be defined as the feature 
that maximizes this criterion; or 1n more intuitive terms, at each stage we eliminate 
the feature that causes the least performance loss after removal. Based on the 
preceding definition of SBS, we can outline the algorithm in four simple steps: 


1. Initialize the algorithm with k=d, where d is the dimensionality of the full 


feature space“. 


2. Determine the feature * that maximizes the criterion: 


x” =argmax J (X, - *)) where ~ © A, 


3. Remove the feature * from the feature set; “#17 “i ~¥ 0K AW) 


4. Terminate if & equals the number of desired features; otherwise, go to _— Z, 


Note 


You can find a detailed evaluation of several sequential feature algorithms in 
Comparative Study of Techniques for Large-Scale Feature Selection, F. Ferri, P. 
Pudil, M. Hatef, and J. Kittler, pages 403-413, 1994. 


Unfortunately, the SBS algorithm has not been implemented in scikit-learn yet. But 
since it is so simple, let us go ahead and implement it in Python from scratch: 


from sklearn.base import clone 

from itertools import combinations 

import numpy as np 

Erom Ssklearn.metrr ies IMporl. accuracy score 

from sSklearn.model Selection ampore Crain teste Split 


Class 3Bo():: 
coef _1mat. self, estimator, Kk teatures, 

SCOrlng=accuracy Score, 
CeSt. S17C=U.20;7, encom state—_f).: 

self.scoring = scoring 

Sel EeeerimeatoOr = Clone (estimator) 

peliuk: Leacures = K teacturcs 

SCligeeou CLZe = tes suze 
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selt.random Stace = Fanoom state 
def fit(self, X, y): 
x tain, x test, y tvain, y test = \ 
trait: test Split ix, VY, Gest 617e-celt.vese ere, 


hanoom state—selt.©renoom stare) 


Cla = % Crain .Shape |.i] 


Selig MClces,. = Eup le (range: (aim) 
SelLisetOoees.. = Set si nerces. | 
SCOre = Sells Calc. SeCte(% tiaw, 7 Gea, 


XK LeSty VY test, Selt.ineices ) 
SeLEwSCOles”. = Score | 


WiLle dim >» Seli.~k. features: 


scores = [] 

subsets = [] 

Por sp in. Combinations (selt.amoi.ces ¢ t=cin = 1)% 
Score = Selits Calc. score(x% train, Y Train, 


m TOSt; VY test, p) 
scores.append (score) 
subsets.append (p) 


best = np.argmax (scores) 
SelistinG1ces = Subsets (best 
SGli.SUDSeUS seppend (selr.ianoices » 
dim -= 1 


SGlLisSCOres. «append scores |Dest)) 
Selif .Kk SCOre = SelTsscotes |i] 


return self 


def transform(self, X): 
return 2 Seki eno tce.s. | 


Gei Calc Score (selt, xX train, y train, XA test, y test, 


indices): 
SClLiEsesti Maloral ti (x Trails, t9etces|, Vy eran) 
VY pred = Sselt,~estimalor.prediCcr(x% tests, ancices), 
SCOlre = Selit.Scoring(y GesL, YY. pred) 


FeLUrRM SCoOre 
In the preceding implementation, we defined the k features parameter to specify 


the desired number of features we want to return. By default, we use the 
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accuracy score from scikit-learn to evaluate the performance of a model (an 
estimator for classification) on the feature subsets. Inside the while loop of the fit 
method, the feature subsets created by the itertools.combination function are 
evaluated and reduced until the feature subset has the desired dimensionality. In each 
iteration, the accuracy score of the best subset is collected in a list, self.scores , 
based on the internally created test dataset x test. We will use those scores later to 
evaluate the results. The column indices of the final feature subset are assigned to 
self.indices , which we can use via the transform method to return a new data 
array with the selected feature columns. Note that, instead of calculating the criterion 
explicitly inside the £it method, we simply removed the feature that is not contained 
in the best performing feature subset. 


Now, let us see our SBS implementation in action using the KNN classifier from 
scikit-learn: 


>>> import matplotlib.pyplot as plt 
>>> from sklearn.neighbors import KNeighborsClassifier 


Poo Kit = KRNGIONDOrSClLassit ier (ni Ne1gioors=5) 


Pe? SOS = SBotKnny, K £eatures—1) 
Pe? S0oe elke Eel eta, VY rata) 


Although our SBS implementation already splits the dataset into a test and training 
dataset inside the fit function, we still fed the training dataset x train to the 
algorithm. The SBS fit method will then create new training subsets for testing 
(validation) and training, which is why this test set is also called the validation 
dataset. This approach is necessary to prevent our original test set from becoming 
part of the training data. 


Remember that our SBS algorithm collects the scores of the best feature subset at 
each stage, so let us move on to the more exciting part of our implementation and 
plot the classification accuracy of the KNN classifier that was calculated on the 
validation dataset. The code is as follows: 


eo? we teat = [ben(k) £Or K am sbs.subsets | 


Foo? Pil» DLOULK. Teal; SUS«SCOLres » Marker—"oO") 
eer lis Viam([O.7, ke0zZ}) 

>>> plt.ylabel ('Accuracy') 

>>> plt.xlabel('Number of features") 

Por Plt.gr16 (0) 
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>>> plt.show() 


As we can see in the following figure, the accuracy of the KNN classifier improved 
on the validation dataset as we reduced the number of features, which is likely due to 
a decrease in the curse of dimensionality that we discussed in the context of the 
KNN algorithm in Chapter 3, A Tour of Machine Learning Classifiers Using scikit- 
learn. Also, we can see in the following plot that the classifier achieved 100 percent 
accuracy for k={3, 7, 8, 9, 10, 11, 12): 
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To satisfy our own curiosity, let's see what the smallest feature subset (k=3) that 
yielded such a good performance on the validation dataset looks like: 


Poo BS = diel lebeselooees LO ]) 

vo? PIII Wine. columns [ils | Ko) 

Index(['Alcohol', 'Malic acid', 'OD280/0D315 of diluted wines'], 
dtype='object') 


Using the preceding code, we obtained the column indices of the three-feature subset 
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from the 10th position in the sbs.subsets_ attribute and returned the corresponding 
feature names from the column-index of the pandas Wine DataFrame. 


Next let's evaluate the performance of the KNN classifier on the original test set: 


PP? KODs«TIt( x Erein Sta; y tiain) 

Zor Prine Training aCccurecy:", Kin.SCOre(% train. Std, Y i train)) 
Trawnaing accuracy: 0.96/7/41935464 

Por Prine( Test accuracy.» Kiss COre (x Test Std, YY test) 

Test. accuracy: 0.962962962963 


In the preceding code section, we used the complete feature set and obtained 
approximately 97 percent accuracy on the training dataset and approximately 96 
percent accuracy on the test, which indicates that our model already generalizes well 
to new data. Now, let us use the selected three-feature subset and see how well KNN 
performs: 


yor Site LEU Ea Seis, Kolig- VY User) 

>> PEite. Dialing, eccureac’s*, 

bese Kn SCOLre(% Lrain Stal, Kol; Yer) ) 
Traiming accuracy: 0.95 612903226 

Por Pring (Test accuracy: *, 

res Knli,sSCOLre(xX Test StGls, Kole VY TSSe) ) 
Test. accuracy] U.925925092592060 


Using less than a quarter of the original features in the Wine dataset, the prediction 
accuracy on the test set declined slightly. This may indicate that those three features 
do not provide less discriminatory information than the original dataset. However, 
we also have to keep in mind that the Wine dataset is a small dataset, which is very 
susceptible to randomness—that is, the way we split the dataset into training and test 
subsets, and how we split the training dataset further into a training and validation 
subset. 


While we did not increase the performance of the KNN model by reducing the 
number of features, we shrank the size of the dataset, which can be useful in real- 
world applications that may involve expensive data collection steps. Also, by 
substantially reducing the number of features, we obtain simpler models, which are 
easier to interpret. 


Note 


Feature selection algorithms in scikit-learn 
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There are many more feature selection algorithms available via scikit-learn. Those 
include recursive backward elimination based on feature weights, tree-based 
methods to select features by importance, and univariate statistical tests. A 
comprehensive discussion of the different feature selection methods is beyond the 
scope of this book, but a good summary with illustrative examples can be found at 
http://scikit-learn.org/stable/modules/feature_selection.html. Furthermore, I 
implemented several different flavors of sequential feature selection, related to the 
simple SBS that we implemented previously. You can find these implementations in 
the Python package mixtend at 


http://rasbt. github.i0/mlxtend/user_guide/feature_selection/SequentialFeatureSelector 
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Assessing feature importance with 
random forests 


In previous sections, you learned how to use L1 regularization to zero out irrelevant 
features via logistic regression, and use the SBS algorithm for feature selection and 
apply it to a KNN algorithm. Another useful approach to select relevant features 
from a dataset is to use a random forest, an ensemble technique that we introduced 
in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. Using a 
random forest, we can measure the feature importance as the averaged impurity 
decrease computed from all decision trees in the forest, without making any 
assumptions about whether our data is linearly separable or not. Conveniently, the 
random forest implementation in scikit-learn already collects the feature importance 
values for us so that we can access them via the feature importances_ attribute 
after fitting a RandomForestClassifier. By executing the following code, we will 
now train a forest of 10,000 trees on the Wine dataset and rank the 13 features by 
their respective importance measures—remember from our discussion in Chapter 3, 
A Tour of Machine Learning Classifiers Using scikit-learn that we don't need to use 
standardized or normalized features in tree-based models: 


>>> from sklearn.ensemble import RandomForestClassifier 
poe Lee epee = OF Wines eo.-umme'l Le] 


Por TOrese = Rangomrores tC Lassiftier (nm est imavors—oU0, 
a rencom Slare=—1) 

fee BOPeees Lek oa, 7 Learn) 
Jor AMpOrtances = LOrest.fcarure jamportances . 


>>> indices = np.argsort (importances) [::-1] 


vee GOL fF ai Pange(% train.snape ll) )- 
Princ ("s20) ~="s oL” ~@ (F + 1, 30, 
feat Jabpele (indices | fi); 

baie importances|[indices[f]])) 
>>> plt.title('Feature Importance') 
per Pliwsoer (range (% Eiraii.shape | ).)-; 

importances [indices], 

align='center') 


Por Pll«X TICKS (lange (x Crain.shape| ||), 


feat. labels; ~oOrearion=7)) 
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ee OL. fly 2) Ere ioeape rl] 
Poo Plt. bignk layout () 


>>> plt.show() 
1) Proline Ol -AB 5455 
2) Flavanoids O.174751 
>) Color Inbensicly 0.143920 
4) OD280/0D315 of diluted wines 0.136162 
5) Alcohol O.. LLC 529 
6) Hue O40 53739 
7) Total phenols 0.050672 
8) Magnesium Oe Ooo o7 
9) Malic acid 0.025648 
10) Proanthocyanins OZ 70 
11) Alcalinity of ash 0.022366 
12) Nonflavanoid phenols 0.013354 
13) Ash 0.0132 79 


After executing the code, we created a plot that ranks the different features in the 
Wine dataset, by their relative importance; note that the feature importance values 
are normalized so that they sum up to 1.0: 


Feature Importance 


Malic acid 
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We can conclude that the proline and flavonoid levels, the color intensity, the 
OD280/OD315 diffraction, and the alcohol concentration of wine are the most 
discriminative features in the dataset based on the average impurity decrease 1n the 
500 decision trees. Interestingly, two of the top-ranked features in the plot are also in 
the three-feature subset selection from the SBS algorithm that we implemented 1n the 
previous section (alcohol concentration and OD280/OD315 of diluted wines). 
However, as far as interpretability is concerned, the random forest technique comes 
with an important gotcha that is worth mentioning. If two or more features are highly 
correlated, one feature may be ranked very highly while the information of the other 
feature(s) may not be fully captured. On the other hand, we don't need to be 
concerned about this problem 1f we are merely interested 1n the predictive 
performance of a model rather than the interpretation of feature importance values. 


To conclude this section about feature importance values and random forests, it 1s 
worth mentioning that scikit-learn also implements a Select FromModel object that 
selects features based on a user-specified threshold after model fitting, which 1s 
useful if we want to use the RandomForestClassifier as a feature selector and 
intermediate step in a scikit-learn Pipeline object, which allows us to connect 
different preprocessing steps with an estimator, as we will see in Chapter 6, Learning 
Best Practices for Model Evaluation and Hyperparameter Tuning. For example, we 
could set the threshold to 0.1 to reduce the dataset to the five most important 
features using the following code: 


yor EYOm sklearn.teavure Selection 2MpOLre oc lecerromMode! 


>>> sfm = SelectFromModel (forest, threshold=0.1, prefit=True) 
yoo SCLC leG, = Sills tlanslormi(s Train) 
>>> print('Number of samples that meet this criterion:', 
xX Selected.shape:| 0] ) 
ones of samples that meet this criterion: 124 


Poe TOr fF it tange(x Selected. shape | 11): 
Prrne(C'sZa) ~=*s 2" ~@ (ft & I, 30, 
Feat apes |} 1morees itil, 
. importances[indices[f]])) 
Proline OI C5453 


1) 

2) Flavanoids ela 7S i 

3) Color intensity 0.143920 

4) OD280/0D315 of diluted wines 0.136162 

5) Alcohol 0.118529 
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Summary 


We started this chapter by looking at useful techniques to make sure that we handle 
missing data correctly. Before we feed data to a machine learning algorithm, we also 
have to make sure that we encode categorical variables correctly, and we have seen 
how we can map ordinal and nominal feature values to integer representations. 


Moreover, we briefly discussed L1 regularization, which can help us to avoid 
overfitting by reducing the complexity of a model. As an alternative approach to 
removing irrelevant features, we used a sequential feature selection algorithm to 
select meaningful features from a dataset. 


In the next chapter, you will learn about yet another useful approach to 
dimensionality reduction: feature extraction. It allows us to compress features onto a 
lower-dimensional subspace, rather than removing features entirely as in feature 
selection. 
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Chapter 5. Compressing Data via 
Dimensionality Reduction 


In Chapter 4, Building Good Training Sets — Data Preprocessing, you learned about 
the different approaches for reducing the dimensionality of a dataset using different 
feature selection techniques. An alternative approach to feature selection for 
dimensionality reduction is feature extraction. In this chapter, you will learn about 
three fundamental techniques that will help us to summarize the information content 
of a dataset by transforming it onto a new feature subspace of lower dimensionality 
than the original one. Data compression 1s an important topic in machine learning, 
and it helps us to store and analyze the increasing amounts of data that are produced 
and collected in the modern age of technology. 


In this chapter, we will cover the following topics: 


e Principal Component Analysis (PCA) for unsupervised data compression 

e Linear Discriminant Analysis (LDA) as a supervised dimensionality reduction 
technique for maximizing class separability 

e Nonlinear dimensionality reduction via Kernel Principal Component 
Analysis (KPCA) 
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Unsupervised dimensionality reduction 
via principal component analysis 


Similar to feature selection, we can use different feature extraction techniques to 
reduce the number of features in a dataset. The difference between feature selection 
and feature extraction is that while we maintain the original features when we used 
feature selection algorithms, such as sequential backward selection, we use feature 
extraction to transform or project the data onto a new feature space. In the context of 
dimensionality reduction, feature extraction can be understood as an approach to data 
compression with the goal of maintaining most of the relevant information. In 
practice, feature extraction is not only used to improve storage space or the 
computational efficiency of the learning algorithm, but can also improve the 
predictive performance by reducing the curse of dimensionality—especially if we are 
working with non-regularized models. 
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The main steps behind principal component 
analysis 


In this section, we will discuss PCA, an unsupervised linear transformation 
technique that is widely used across different fields, most prominently for feature 
extraction and dimensionality reduction. Other popular applications of PCA include 
exploratory data analyses and de-noising of signals in stock market trading, and the 
analysis of genome data and gene expression levels in the field of bioinformatics. 


PCA helps us to identify patterns in data based on the correlation between features. 
In a nutshell, PCA aims to find the directions of maximum variance 1n high- 
dimensional data and projects it onto a new subspace with equal or fewer dimensions 
than the original one. The orthogonal axes (principal components) of the new 
subspace can be interpreted as the directions of maximum variance given the 
constraint that the new feature axes are orthogonal to each other, as illustrated in the 
following figure: 
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- 


In the preceding figure, ~! and “> are the original feature axes, and PC1 and PC2 
are the principal components. 


If we use PCA for dimensionality reduction, we construct a adxk —dimensional 
transformation matrix W that allows us to map a sample vector x onto a new k— 
dimensional feature subspace that has fewer dimensions than the original d— 
dimensional feature space: 
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be | Zr MeyncesR |) 7ZER 


As a result of transforming the original d-dimensional data onto this new k- 
dimensional subspace (typically 4 << d), the first principal component will have the 
largest possible variance, and all consequent principal components will have the 
largest variance given the constraint that these components are uncorrelated 
(orthogonal) to the other principal components—even if the input features are 
correlated, the resulting principal components will be mutually orthogonal 
(uncorrelated). Note that the PCA directions are highly sensitive to data scaling, and 
we need to standardize the features prior to PCA if the features were measured on 
different scales and we want to assign equal importance to all features. 


Before looking at the PCA algorithm for dimensionality reduction in more detail, 
let's summarize the approach in a few simple steps: 


1. Standardize the d-dimensional dataset. 

2. Construct the covariance matrix. 

3. Decompose the covariance matrix into its eigenvectors and eigenvalues. 

4. Sort the eigenvalues by decreasing order to rank the corresponding 
eigenvectors. 

5. Select k eigenvectors which correspond to the & largest eigenvalues, where k 1s 


the dimensionality of the new feature subspace (K Sd ). 
6. Construct a projection matrix W from the "top" & eigenvectors. 
7. Transform the d-dimensional input dataset X using the projection matrix W to 
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obtain the new A-dimensional feature subspace. 


In the following sections, we will perform a PCA step by step, using Python as a 
learning exercise. Then, we will see how to perform a PCA more conveniently using 
scikit-learn. 
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Extracting the principal components step by 
Step 


In this subsection, we will tackle the first four steps of a PCA: 


1. Standardizing the data. 

2. Constructing the covariance matrix. 

3. Obtaining the eigenvalues and eigenvectors of the covariance matrix. 
4. Sorting the eigenvalues by decreasing order to rank the eigenvectors. 


First, we will start by loading the Wine dataset that we have been working with in 
Chapter 4, Building Good Training Sets — Data Preprocessing: 

>>> import pandas as pd 

df wine = pd.read csv('https://archive.ics.uci.edu/ml/' 


'machine-learning-databases/wine/wine.data', 
header=None) 


Note 


You can find a copy of the Wine dataset (and all other datasets used in this book) in 
the code bundle of this book, which you can use if you are working offline or the 


UCI server at https://archive.ics.uci.edu/ml/machine-learning- 
databases/wine/wine.data is temporarily unavailable. For instance, to load the Wine 


dataset from a local directory, you can replace the following line: 


df = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases/wine/wine.data', header=None) 


Replace it with this: 


df = pd.read csv('your/local/path/to/wine.data', header=None) 


Next, we will process the Wine data into separate training and test sets—using 70 
percent and 30 percent of the data, respectively—and standardize it to unit variance: 


Zoo LOM Sharh «MOUS SeleCceLon TMPOrL Train Teste Splut 


Zoe dy VY = OF Wine. lOCis, Mileavaluce, Gr wane. loci>, Ol] <values 
>>> X_ train, X test, y train, y test = \ 
>>> Lilet test. splicix, VY, Test 6176-0.3, 


Stravity=y, 
random. State=)) 
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>>> # standardize the features 

>>> from sklearn.preprocessing import StandardScaler 
>>> sc = StandardsScaler () 

PoP x Crain Std. = SC.iat Cransrorm(x train) 

Zor & Vest SUC = SC. transtorm(x% test) 


After completing the mandatory preprocessing by executing the preceding code, let's 
advance to the second step: constructing the covariance matrix. The symmetric 


dxd -dimensional covariance matrix, where d is the number of dimensions in the 
dataset, stores the pairwise covariances between the different features. For example, 


X. ee 
the covariance between two features “ and = “ onthe population level can be 
calculated via the following equation: 


O1 = “ > (2 — fi, | Stl fi, 


Here, a and Bk are the sample means of features 7 and k, respectively. Note that 
the sample means are zero if we standardized the dataset. A positive covariance 
between two features indicates that the features increase or decrease together, 
whereas a negative covariance indicates that the features vary in opposite directions. 
For example, the covariance matrix of three features can then be written as follows 
(note that » stands for the Greek uppercase letter sigma, which is not to be confused 
with the sum symbol): 


oF O,, O, 


> = O>, oO; O53 


The eigenvectors of the covariance matrix represent the principal components (the 
directions of maximum variance), whereas the corresponding eigenvalues will define 
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their magnitude. In the case of the Wine dataset, we would obtain 13 eigenvectors 
and eigenvalues from the 13 x 13-dimensional covariance matrix. 


Now, for our third step, let's obtain the eigenpairs of the covariance matrix. As we 
remember from our introductory linear algebra classes, an eigenvector v satisfies the 
following condition: 


“v=Ayp 


Here, “ is a scalar: the eigenvalue. Since the manual computation of eigenvectors 
and eigenvalues is a somewhat tedious and elaborate task, we will use the 
linalg.eig function from NumPy to obtain the eigenpairs of the Wine covariance 
matrix: 


>>> import numpy as np 
Por COV Meat = Np.COvV(x £rein sto .T) 
vor SOC Vals, S10en veces = np«linalg.61¢ (Cov. mar) 


O 


>>> print('\nEigenvalues \n%s' % eigen vals) 

Eigenvalues 

[. 4.84274532 2.41602459 1.54845825 0.96120438 0.84166161 0.6620634 
0.51828472 0.34650377 0.3131368 OetOTS4647 OUse2ZhSo7TZi> D0.15362335 
0.1808613 ] 


Using the numpy.cov function, we computed the covariance matrix of the 
standardized training dataset. Using the linalg.eig function, we performed the 
e1gendecomposition, which yielded a vector (eigen vals) consisting of 13 
eigenvalues and the corresponding eigenvectors stored as columns in a 13 x 13- 
dimensional matrix (eigen vecs). 


Note 


The numpy.linalg.eig function was designed to operate on both symmetric and 
non-symmetric square matrices. However, you may find that 1t returns complex 
eigenvalues in certain cases. 


A related function, numpy.linalg.eigh, has been implemented to decompose 


Hermetian matrices, which 1s a numerically more stable approach to work with 
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symmetric matrices such as the covariance matrix; numpy.linalg.eigh always 
returns real eigenvalues. 
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Total and explained variance 


Since we want to reduce the dimensionality of our dataset by compressing it onto a 
new feature subspace, we only select the subset of the eigenvectors (principal 
components) that contains most of the information (variance). The eigenvalues 
define the magnitude of the eigenvectors, so we have to sort the eigenvalues by 
decreasing magnitude; we are interested 1n the top & eigenvectors based on the values 
of their corresponding eigenvalues. But before we collect those & most informative 
eigenvectors, let us plot the variance explained ratios of the eigenvalues. The 


variance explained ratio of an eigenvalue / is simply the fraction of an eigenvalue 


and the total sum of the eigenvalues: 





Using the NumPy cumsum function, we can then calculate the cumulative sum of 
explained variances, which we will then plot via Matplotlib's step function: 


Por toe = GuN(e1gen Vals) 

>>> var exp = [(i / tot) for i in 
Serled(e jem Vals, Teversc—irue) | 

eee CUM. Var exp = Np.wcumsum(var Sxp) 

7e7 MMDOMe, Nat LorLADepyoloe: aS “pie 

Poe DitwOet | range (1,4); War exp, ealpne=U.0;, alagn— Center’, 

es label='individual explained variance') 

por Plt«Step (range (1,14), Cum. var exp, where="m1a", 

are label='cumulative explained variance') 

poor Pltsylabel ("Explained variance ratio" } 

>>> plt.xlabel('Principal component index') 

>>> plt.legend(loc="best') 

>>> plt.show() 
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The resulting plot indicates that the first principal component alone accounts for 
approximately 40 percent of the variance. Also, we can see that the first two 
principal components combined explain almost 60 percent of the variance in the 
dataset: 


oS 
oe 
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—— cumulative explained variance 
~ individual explained variance 
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Although the explained variance plot reminds us of the feature importance values 
that we computed in Chapter 4, Building Good Training Sets — Data Preprocessing, 
via random forests, we should remind ourselves that PCA is an unsupervised 
method, which means that information about the class labels is ignored. Whereas a 
random forest uses the class membership information to compute the node 
impurities, variance measures the spread of values along a feature axis. 
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Feature transformation 


After we have successfully decomposed the covariance matrix into eigenpairs, let's 
now proceed with the last three steps to transform the Wine dataset onto the new 
principal component axes. The remaining steps we are going to tackle in this section 
are the following ones: 


e Select k eigenvectors, which correspond to the k largest eigenvalues, where k 1s 


the dimensionality of the new feature subspace ( A<d ). 

e Construct a projection matrix W from the "top" & eigenvectors. 

e Transform the d-dimensional input dataset X using the projection matrix W to 
obtain the new A-dimensional feature subspace. 


Or, in less technical terms, we will sort the eigenpairs by descending order of the 
eigenvalues, construct a projection matrix from the selected eigenvectors, and use 
the projection matrix to transform the data onto the lower-dimensional subspace. 


We start by sorting the eigenpairs by decreasing order of the eigenvalues: 


>>> # Make a list of (eigenvalue, eigenvector) tuples 

277 SiGe pairs. = | (ipeavs (er~en Vale il }y Eigen vecs(s, a) 

ous for 1 in range(len(eigen vals) ) ] 

>>> # Sort the (eigenvalue, eigenvector) tuples from high to low 
Por SlOen Pali.rsssore (kKey=lambda. Ke K[Ul, Deverse=irue) 


Next, we collect the two eigenvectors that correspond to the two largest eigenvalues, 
to capture about 60 percent of the variance in this dataset. Note that we only chose 
two eigenvectors for the purpose of illustration, since we are going to plot the data 
via a two-dimensional scatter plot later in this subsection. In practice, the number of 
principal components has to be determined by a trade-off between computational 
efficiency and the performance of the classifier: 


eer Ww = Npsdstack ((e1gen pairs | Ol! ill it, Apswsnewaxis |, 
oa eigen. pairs(i llilsy Op.newax1s])) 
>>> print('Matrix W:\n', w) 
Matrix W: 
fi—O~bL3IZ4a216 0.4.50303476) 
O.24724326 0.16487119] 
02545159 0.24456476] 
20094508 -0.11352904] 
sLogSo5eZ U«~289 74516) 
-39376952 0.05080104] 
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et ooo =e 7 0) oe. 
 UeS0U57Z590 0.0904cc05) 
[“U.S0UGG05047 1.005 55253) 
| UseOI554066 O.5497 7501) 
PPUsesoZ6l5Z65 =-U2207 16433) 
[Veo eOOLU2 =U.74902556) 
[=U22960965) U.s00279472].) 


By executing the preceding code, we have created a 13 x 2-dimensional projection 
matrix W from the top two eigenvectors. 


Note 


Depending on which version of NumPy and LAPACK you are using, you may 
obtain the matrix W with its signs flipped. Please note that this is not an issue; if v 1s 


. . 
an eigenvector of a matrix —, we have: 


“v=Ayp 


| 


i 4 
Here “ is our eigenvalue, and -“ is also an eigenvector that has the same 
eigenvalue, since: 


>'-(-v) =—y> =-jyp =/A-(-v) 


Using the projection matrix, we can now transform a sample x (represented as a 1 x 
13-dimensional row vector) onto the PCA subspace (the principal components one 


F 
and two) obtaining * , now a two-dimensional sample vector consisting of two new 
features: 


x =xWw 
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eo? Kk erein sta lO) .oov(w) 
array([ 2.38299011, 0.45458499])) 


Similarly, we can transform the entire 124 x 13-dimensional training dataset onto the 
two principal components by calculating the matrix dot product: 


X'= XW 


ee? x Ulan pea = xX Trait Sra.dov(w) 


Lastly, let us visualize the transformed Wine training set, now stored as an 124 x 2- 
dimensional matrix, in a two-dimensional scatterplot: 


Po COOLS. = | tet, **; “GO | 
Por Markers = | *et> ss", 10" | 
Por TOF ky Cy MW Am Zip (ap.unigque ly train), Cohors;, Markers): 
Pilwscaluer(xX% Train pCaly train-=l, Ul, 
xX train pcaly traian==l, 1], 
S44 c=c, lLabel=l, marker=m) 
>>> plt.xlabel('PC 1') 
peer Pits VlabelL ("Pe 2) 
>>> plt.legend(loc='lower left') 
>>> plt.show() 


As we can see 1n the resulting plot, the data is more spread along the x-axis—the first 
principal component—than the second principal component (y-axis), which is 
consistent with the explained variance ratio plot that we created in the previous 
subsection. However, we can intuitively see that a linear classifier will likely be able 
to separate the classes well: 
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Although we encoded the class label information for the purpose of illustration in the 
preceding scatter plot, we have to keep in mind that PCA 1s an unsupervised 
technique that doesn't use any class label information. 
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Principal component analysis in scikit-learn 


Although the verbose approach in the previous subsection helped us to follow the 
inner workings of PCA, we will now discuss how to use the pca class implemented 
in scikit-learn. The pca class 1s another one of scikit-learn's transformer classes, 
where we first fit the model using the training data before we transform both the 
training data and the test dataset using the same model parameters. Now, let's use the 
PCA Class from scikit-learn on the Wine training dataset, classify the transformed 
samples via logistic regression, and visualize the decision regions via the 

plot decision region function that we defined in Chapter 2, Training Simple 
Machine Learning Algorithms for Classification: 


from matplotlib.colors import ListedColormap 
Get PLO Cecis10On- reg lons(, VY, Cleositiier, Lecclurton= 0.02): 


# setup marker generator and color map 


TAT 


Markers = ('S6*,;, "x", "oO", 'yv') 

colors = ('red', ‘'blue', 'lightgreen', 'gray', ‘'cyan') 

cmap = ListedColormap(colors[:len(np.unique(y) ) ]) 

# plot the decision surface 

x) min, XL Max = Ale, VUlsemint) = ty Ale, Olemaxt) ai 1 

XZ _Winy, XZ Max = Mis, Lye) = 1, Ale, Liamaxt) a 2 

xxl, XkZ = Np. mesngr1a(np.arange (x). min, xl Max, resolution), 
np.arange(x2 min, x2 max, resolution) ) 

Z = classifier.predict(np.array([xxl.ravel(), xx2.ravel()]).T) 


Z= Z.reshape (xxl.shape) 

plt.contourf (xxl, xx2, Z, alpha=0.4, cmap=cmap) 
PLCs x1 xl Smit)» x<x<lsmax ():) 
plt.ylim(xx2.min(), xXx2.max()) 


# plot class samples 
for idx, cl in enumerate (np.unique(y)): 


plLt.SCatter S=xXly == cl, Ul, 
V=XLyY == Cl, Liz 
alpha=0.6, 


C=Ciiap(1dx), 
edgecolor='"black', 
marker=markers[1idx], 
label=cl1) 


oo? LOM SklGadtislinear model. import. Ogi stichegress..on 
>>> from sklearn.decomposition import PCA 
eer PCa = PCAC Components=—) 
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>>> lr = LogisticRegression() 

Poo KX Grain pCa = pCa.tit. transiorm (x train std) 

7e> & Ces pCa. = PCaslLransiorm(% Tese sia) 

por Ji eP ies Grain PCa, Y- tiaam) 

Poo PlOL decisi0n Teg1Oons(xX Uraim pCa, y train, Classifier =i7) 
o> Dltsxklabel( "PC. 1") 

>>> plt.ylabel ("PC 2') 

>>> plt.legend(loc='lower left') 

>>> plt.show() 


By executing the preceding code, we should now see the decision regions for the 
training data reduced to two principal component axes: 








When we compare PCA projections via scikit-learn with our own PCA 
implementation, it can happen that the resulting plots are mirror images of each 
other. Note that this is not due to an error in either of those two implementations, but 
the reason for this difference is that, depending on the eigensolver, eigenvectors can 
have either negative or positive signs. Not that it matters, but we could simply revert 
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the mirror image by multiplying the data by -1 if we wanted to; note that 
eigenvectors are typically scaled to unit length 1. For the sake of completeness, let's 
plot the decision regions of the logistic regression on the transformed test dataset to 
see if it can separate the classes well: 


Pe? PIOU CeCils10On £eg1one (xX tesu pCa; yy test; Classi £161r=-11) 
Por DPit.~xlabel (* PCI) 

>>> plt.ylabel ('PC2") 

>>> plt.legend(loc='lower left') 

>>> plt.show() 


After we plotted the decision regions for the test set by executing the preceding code, 
we can see that logistic regression performs quite well on this small two-dimensional 
feature subspace and only misclassifies very few samples in the test dataset: 








If we are interested in the explained variance ratios of the different principal 
components, we can simply initialize the pca class with the n components parameter 
set to None, so all principal components are kept and the explained variance ratio can 
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then be accessed via the explained variance ratio attribute: 


yer pea. = PCA Componecnts=None) 

Jer & Crain pCa = PCa.Tit. Cranstorm(%. “treinm std) 

Per PCawexXp laine: Vartance: 7ab1o-. 

ebrayi|! 0w.509 01469, Usloso4eZs, Usdtelolog, 02073394202, U«.0G472106, 
OsQ00SLIZ4, UsVS994004, 02026045910, UDU.02309519, Us.DLozZyol4, 
U,0LS500Zl, O.01LL72226, 0.00020009)) 


Note that we set n_ components=None when we initialized the Pca class so that it will 
return all principal components in a sorted order instead of performing a 
dimensionality reduction. 
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Supervised data compression via linear 
discriminant analysis 


Linear Discriminant Analysis (LDA) can be used as a technique for feature 
extraction to increase the computational efficiency and reduce the degree of 
overfitting due to the curse of dimensionality in non-regularized models. 


The general concept behind LDA is very similar to PCA. Whereas PCA attempts to 
find the orthogonal component axes of maximum variance in a dataset, the goal in 
LDA is to find the feature subspace that optimizes class separability. In the following 
sections, we will discuss the similarities between LDA and PCA in more detail and 
walk through the LDA approach step by step. 
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Principal component analysis versus linear 
discriminant analysis 


Both LDA and PCA are linear transformation techniques that can be used to reduce 
the number of dimensions in a dataset; the former is an unsupervised algorithm, 
whereas the latter 1s supervised. Thus, we might intuitively think that LDA 1s a 
superior feature extraction technique for classification tasks compared to PCA. 
However, A.M. Martinez reported that preprocessing via PCA tends to result in 
better classification results in an image recognition task 1n certain cases, for instance 
if each class consists of only a small number of samples (PCA Versus LDA, A. M. 
Martinez and A. C. Kak, IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 23(2): 228-233, 2001). 


Note 


LDA 1s sometimes also called Fisher's LDA. Ronald A. Fisher initially formulated 
Fisher's Linear Discriminant for two-class classification problems in 1936 (The Use 
of Multiple Measurements in Taxonomic Problems, R. A. Fisher, Annals of Eugenics, 
7(2): 179-188, 1936). Fisher's linear discriminant was later generalized for multi- 
class problems by C. Radhakrishna Rao under the assumption of equal class 
covariances and normally distributed classes in 1948, which we now call LDA (The 
Utilization of Multiple Measurements in Problems of Biological Classification, C. R. 
Rao, Journal of the Royal Statistical Society. Series B (Methodological), 10(2): 159- 
203, 1948). 


The following figure summarizes the concept of LDA for a two-class problem. 
Samples from class 1 are shown as circles, and samples from class 2 are shown as 
crosses: 
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A linear discriminant, as shown on the x-axis (LD 1), would separate the two normal 
distributed classes well. Although the exemplary linear discriminant shown on the y- 
axis (LD 2) captures a lot of the variance in the dataset, it would fail as a good linear 
discriminant since it does not capture any of the class-discriminatory information. 


One assumption in LDA 1s that the data is normally distributed. Also, we assume that 
the classes have identical covariance matrices and that the features are statistically 
independent of each other. However, even if one or more of those assumptions are 
(slightly) violated, LDA for dimensionality reduction can still work reasonably well 
(Pattern Classification 2nd Edition, R. O. Duda, P. E. Hart, and D. G. Stork, New 
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York, 2001). 


WOW! eBook 
www.wowebook.org 


The inner workings of linear discriminant 
analysis 


Before we dive into the code implementation, let's briefly summarize the main steps 
that are required to perform LDA: 


l. 
Z; 


3. 


Standardize the d-dimensional dataset (d 1s the number of features). 
For each class, compute the d-dimensional mean vector. 


Construct the between-class scatter matrix S; and the within-class scatter 


_ ae 
matrix ". 
Compute the eigenvectors and corresponding eigenvalues of the matrix 


| 
ss. 


Sort the eigenvalues by decreasing order to rank the corresponding 
eigenvectors. 
Choose the & eigenvectors that correspond to the « largest eigenvalues to 


construct a d xk -dimensional transformation matrix W; the eigenvectors are 
the columns of this matrix. 

Project the samples onto the new feature subspace using the transformation 
matrix W. 


As we can see, LDA 1s quite similar to PCA in the sense that we are decomposing 
matrices into eigenvalues and eigenvectors, which will form the new lower- 
dimensional feature space. However, as mentioned before, LDA takes class label 
information into account, which 1s represented in the form of the mean vectors 
computed in step 2. In the following sections, we will discuss these seven steps in 
more detail, accompanied by illustrative code implementations. 
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Computing the scatter matrices 


Since we already standardized the features of the Wine dataset in the PCA section at 
the beginning of this chapter, we can skip the first step and proceed with the 
calculation of the mean vectors, which we will use to construct the within-class 


; ; :; Hl 
scatter matrix and between-class scatter matrix, respectively. Each mean vector ' 
stores the mean feature value “" with respect to the samples of class 7: 


m,=— ) x, 
I 


"i xweD 


This results in three mean vectors: 


H; dleoheal 


ri malic acid 


m, = 7 j¢41,2,3} 


f f; _ praline 


Por TiPeSel PrInCopleions (precis.10n=4) 
>>> mean _vecs = [] 
>>> for label in range(1,4): 
mean. VECS.eppenda (np.~mean.( 
x train. Sstaly train=—lavel |, axis=0) ) 
sen Prant(’MV 2s: cs\n’ <(label, mean vecs|label=—1))) 
My che |. 049060 =0.3497 0.50201 =O.71609 0.5056 0.0807 0.9589 =0.5516 
Ve04iG DeZzoso Uso689) OUs.6505 1.2075] 


My 2 (Uso a2 “0.7040 ~Ua0's> Uso bo? =O.50070 ~U.e045s U.-0055: -0.0740 
U,.0705 =040Z266 U.5144 02.5000 =0.7253) 


My oe Tf U.1992 ULoGG UcdGez O,AIAaS =0.0451 =1L.0266 =L.2576. Use257/ 
—Coveoe WaJoee —lezd? Sheoo77 <—0.40 1.) 
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: ae . SS, 
Using the mean vectors, we can now compute the within-class scatter matrix ~": 





This is calculated by summing up the individual scatter matrices 5, of each 
individual class 7: 


= Z, (x—m,)(x—m, )" 


xe DP), 


>>> d = 13 # number of features 
Per So W = NPsZeros ((dy C) ) 
por LOL tabel, MV 1n Zap(range (1, 4), Mean Vvecs) : 


. Class. (Cote = Bo.7- ool, a)? 

oo LOr COW 10 % tial Std ly trait == Lane | 
row, mv = row.reShape(d, 1), mv.reshape(d, 1) 
Class SCaLtcr += (TOW =— MV) «Out (eOow = my) wl) 


o Was Class SCatrer 
>>> prant("Within=class Scatter matrix? ssx<7s' «. { 
> Weshape (Ol, S&S Wseshape| 1.) )) 
Within- class scatter matrix: 13x13 


The assumption that we are making when we are computing the scatter matrices is 
that the class labels in the training set are uniformly distributed. However, if we print 
the number of class labels, we see that this assumption 1s violated: 


Zo DPraIne ("Olaes dabel distri bucLon: as" 


(e) 


eo Heo TiC oumoy Wea ta) 
Class label distribution: [41 50 33] 


7 


Thus, we want to scale the individual scatter matrices S, before we sum them up as 
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scatter matrix Sn . When we divide the scatter matrices by the number of class- 
HT. Sas 
samples ',wecan see that computing the scatter matrix is in fact the same as 


computing the covariance matrix “~~! —the covariance matrix 1s a normalized version 
of the scatter matrix: 


~ —Sy =— > (x —m, )(x—m, ) 


/ l. il, xeD, 


>>> d = 13 # number of features 
Peo W >= TDs ZerosS( (CG, Ga) ) 
Porm Lor label,my in; Zap(range( 1], 4), Mean vecs) : 
Class. -SCeller = Np. covi...Uraim Staly vrain=—Llebel|. 7) 
> W T= Class SCatter 
Po > Prine (Scaled. within=-Class scatter Matrix: sexs’ 
6 (S W.shape[O], S W.shape[1])) 
Seiad within-class scatter matrix: 13x13 


After we computed the scaled within-class scatter matrix (or covariance matrix), we 


4 


can move on to the next step and compute the between-class scatter matrix Ss 


Cc an 
=) n,(m,—m)(m,—m ) 
i=l 


Here, “ is the overall mean that is computed, including samples from all classes: 


>>> mean overall = np.mean(X train std, axis=0) 
>>> d = 13 # number of features 

Zor OB = MDsZeros ( (GC, a) ) 

por TOY jay Mean Vec 1m enumMerave (mean Vvecs) : 


iW = © train[y tras == 1 + 1, ¢)seshepel 0 
mean vec = mean vec.reshape(d, 1) # make column vector 
mean overall = mean overall.reshape(d, 1) 
op t= 1 * AMean ‘vec = Mean Overall) .00t.( 
(Meet. Voc = mean Overat.) - 1) 
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e) 


>>> print('Between-class scatter matrix: ssx%s' @ ( 
©» B.shape[0], S _B.shape[1])) 
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Selecting linear discriminants for the new 
feature subspace 


The remaining steps of the LDA are similar to the steps of the PCA. However, 
instead of performing the eigendecomposition on the covariance matrix, we solve the 


| 
S,, S;, . 


generalized eigenvalue problem of the matrix 


>>> eigen vals, eigen vecs =\ 
Np.» linealg.erg(npslinealg.anv(s W).do0ti(o B).) 


After we computed the eigenpairs, we can now sort the eigenvalues in descending 
order: 


Por BLOCH Pairs = | (np.ebs (eigen Valelal), e1¢en veces [i721] 
24 for 1 in range(len(eigen vals) ) ] 
Po? Sigen Pairs = Sorleai(e1oem pairs, 


ar key=lambda k: k[0O], reverse=True) 
>>> print('Eigenvalues in descending order:\n') 
Po? LOY Groen Val 1m, Clgen: pairs: 

Prine (e1genm val.| 01) 


Eigenvalues in descending order: 


549,601. 738068906 
2s 1 OLOZZALD 

Je POO Loe OI Zoe 
~11739844822e-14 
-91646188942e-14 
-591646188942e-14 
so 19067 40S e-14 
sO 19007 Le205e=14 
soo 1.1 0US 71 G5e—1L5 
»J0UGU3I9644 7e-1L5 
-90603998447/e-15 
200441 973597E=1L5 
0 


cS So OF OF a=) eS SS ee he 


In LDA, the number of linear discriminants 1s at most c—/, where c 1s the number of 


class labels, since the in-between scatter matrix S; is the sum of c matrices with 
rank | or less. We can indeed see that we only have two nonzero eigenvalues (the 
eigenvalues 3-13 are not exactly zero, but this is due to the floating point arithmetic 
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in NumPy). 
Note 


Note that in the rare case of perfect collinearity (all aligned sample points fall on a 
straight line), the covariance matrix would have rank one, which would result in only 
one eigenvector with a nonzero eigenvalue. 


To measure how much of the class-discriminatory information 1s captured by the 
linear discriminants (eigenvectors), let's plot the linear discriminants by decreasing 
eigenvalues similar to the explained variance plot that we created in the PCA section. 
For simplicity, we will call the content of class-discriminatory information 
discriminability: 


zor TOC = Sum(eigen Vvals.reat) 

>>> discr = [(i / tot) for 1 in sorted(eigen vals.real, reverse=True) ] 
Po CUM. C1Ser = Np.cumsum(di1secr) 

>>> plt.bar(range(l, 14), discr, alpha=0.5, align='center', 
S46 label='individual "discriminability"') 

Poo PlLU.step (range (1, 14), Cum ¢d1scr, where=' mid’, 

eae label='cumulative "discriminability"') 

>o> plt.ylabel (*"discriminability” ratio’ } 

>>> plt.xlabel('Linear Discriminants') 

o> Plt. yVlimi f=O.l, 1.1)) 

>>> plt.legend(loc='best') 

>>> plt.show() 


As we can see 1n the resulting figure, the first two linear discriminants alone capture 
100 percent of the useful information in the Wine training dataset: 
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Linear Discriminants 





Let's now stack the two most discriminative eigenvector columns to create the 
transformation matrix W: 


Pom Ww = Np.astack( (e1906n: pairs 0] (lis, np wnewaxi1s| «real, 

Sane eigen pasrs(liiijJisy opsnewexis] .real)) 

>>> print ('Matrix W:\n', w) 

Matrix W: 

[[-0.1481 -0.4092] 

0.09068. =U;,.157 7] 

JOL6S =n S50 1] 

eee: e377 5. 

pos FO oul 7. 

~1913 0.0842] 

st SOO WeZoZo) 
| 
| 
| 
| 
| 
| 


| 
> () > Gy Sf SS XS Oe Oe 


(Jt =O6.OLOZ 
JOLCG, Q.0907 
coe SUZ 
-0328 Q.2747 
.o047 =O. O74 
cool) =. o 


mom eT TOT Ter T rT rT er Tor To TTT 
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Projecting samples onto the new feature space 


Using the transformation matrix W that we created in the previous subsection, we 
can now transform the training dataset by multiplying the matrices: 


X'= XW 


>>> 
>>> 
>>> 
>>> 


>>> 
>>> 
>>> 
>>> 





xX train toa = 7 train Sstc.dor (w) 

COLORS = ("ry “Db, *G."] 

markers = ['s', 'x', ‘o'] 

LO iy Cy, Man Zip(op.untguet train), colores, Naevkers). 

PlLi.scaceer (xX train Idaly train-=l, Ol, 

xX tein lodely trarna==i, 1] * (=1), 
c=c, label=l, marker=m) 

plt.xlabel('LD 1") 

plt.ylabel('LD 2') 

plt.legend(loc='lower right') 

plt.show () 


As we can see in the resulting plot, the three wine classes are now perfectly linearly 
separable in the new feature subspace: 
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LDA via scikit-learn 


The step-by-step implementation was a good exercise to understand the inner 
workings of an LDA and understand the differences between LDA and PCA. Now, 
let's look at the Lpa class implemented in scikit-learn: 


Zor EEOM SKLeatTi.OLSCriImi nant analysis Jmpore 
LinearDiscriminantAnalysis as LDA 

eee £06. = GDA(h components = 2) 

we? & Ceding 106. > d0a.Fit trans Ormtx. train. Std, Y train) 


Next, let's see how the logistic regression classifier handles the lower-dimensional 
training dataset after the LDA transformation: 


>>> lr = LogisticRegression () 

eer it = Leet e tre 1ca; Y train) 

Pe? CLOlL OSCIS1On FeQl1One(x train de, YY train, Clhassitier= ir) 
Po Olt. KLabed. (*D: 1%) 

>>> plt.ylabel ("LD 2') 

>>> plt.legend(loc='"lower left') 

>>> plt.show() 


Looking at the resulting plot, we see that the logistic regression model misclassifies 
one of the samples from class 2: 
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By lowering the regularization strength, we could probably shift the decision 
boundaries so that the logistic regression model classifies all samples in the training 
dataset correctly. However, and more importantly, let us take a look at the results on 
the test set: 


7e? & TeSt '£Gd = 10da.transtormi(x tes. stc) 

Pe? O1OU CeCisiOn tegions (x vese ida, y test; Class 1£1er=17) 
wee Poe coe. aD ay | 

>>> plt.ylabel ("LD 2') 

>>> plt.legend(loc='lower left') 

>>> plt.show() 


As we can see in the following plot, the logistic regression classifier is able to get a 
perfect accuracy score for classifying the samples in the test dataset by only using a 
two-dimensional feature subspace instead of the original 13 Wine features: 
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Using Kernel principal component 
analysis for nonlinear mappings 


Many machine learning algorithms make assumptions about the linear separability of 
the input data. You learned that the perceptron even requires perfectly linearly 
separable training data to converge. Other algorithms that we have covered so far 
assume that the lack of perfect linear separability is due to noise: Adaline, logistic 
regression, and the (standard) SVM to just name a few. 


However, 1f we are dealing with nonlinear problems, which we may encounter rather 
frequently in real-world applications, linear transformation techniques for 
dimensionality reduction, such as PCA and LDA, may not be the best choice. In this 
section, we will take a look at a kernelized version of PCA, or KPCA, which relates 
to the concepts of kernel SVM that we remember from Chapter 3, A Tour of Machine 
Learning Classifiers Using scikit-learn. Using kernel PCA, we will learn how to 
transform data that is not linearly separable onto a new, lower-dimensional subspace 
that is suitable for linear classifiers. 
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Kernel functions and the kernel trick 


As we remember from our discussion about kernel SVMs in Chapter 3, A Tour of 
Machine Learning Classifiers Using scikit-learn, we can tackle nonlinear problems 
by projecting them onto a new feature space of higher dimensionality where the 


” a . . 
classes become linearly separable. To transform the samples **= "onto this higher 


k-dimensional subspace, we defined a nonlinear mapping function p 


b:R?>R* (k>>d) 


We can think of p as a function that creates nonlinear combinations of the original 
features to map the original d-dimensional dataset onto a larger, k-dimensional 


1h 
feature space. For example, if we had a feature vector *= IK (x is a column vector 





. | _ (d=2 | | 
consisting of d features) with two dimensions ( ) , a potential mapping onto a 
3D-space could be: 
T 
oe : 2 
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In other words, we perform a nonlinear mapping via kernel PCA that transforms the 
data onto a higher-dimensional space. We then use standard PCA in this higher- 
dimensional space to project the data back onto a lower-dimensional space where the 
samples can be separated by a linear classifier (under the condition that the samples 
can be separated by density in the input space). However, one downside of this 
approach is that it is computationally very expensive, and this is where we use the 
kernel trick. Using the kernel trick, we can compute the similarity between two 
high-dimension feature vectors in the original feature space. 


Before we proceed with more details about the kernel trick to tackle this 
computationally expensive problem, let us think back to the standard PCA approach 
that we implemented at the beginning of this chapter. We computed the covariance 
between two features k and / as follows: 


of 5 » ( 93 = Ll, | x0) — yn, 


° oe 7 f = Q 
Since the standardizing of features centers them at mean zero, for instance, © “ 


and Mk ~ y , we can simplify this equation as follows: 
LD) 
. 1 
Cin =— DX; X 
Fl i-] 


Note that the preceding equation refers to the covariance between two features; now, 
let us write the general equation to calculate the covariance matrix =~: 
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Bernhard Scholkopf generalized this approach (Kernel principal component 
analysis, B. Scholkopf, A. Smola, and K.R. Muller, pages 583-588, 1997) so that we 
can replace the dot products between samples in the original feature space with the 


nonlinear feature combinations via p 


ss ~ 2 ? ( x) p(x" yr 


To obtain the e1genvectors—the principal components—from this covariance matrix, 
we have to solve the following equation: 


“v=Ayp 


aS aoe vat 
i=l | 


| 


=> v=— YJ o(x)9(x Oy y=-u “"o o(x er 


HLA i=] | IT i=] 


WOW! eBook 
www.wowebook.org 


Here, A and v are the eigenvalues and eigenvectors of the covariance matrix Z , and 


a can be obtained by extracting the eigenvectors of the kernel (similarity) matrix K, 
as we will see in the next paragraphs. 


Note 


The derivation of the kernel matrix can be shown as follows. First, let's write the 


b(X) 
covariance matrix as in matrix notation, where © * is an n x k-dimensional 


matrix: 


Y= do(x")9(x") = 4(X) 9X) 


} 


Now, we can write the eigenvector equation as follows: 


"eo =Sua'g Co = Ad| xy ‘ 
i , 


“v=Ayp 


Since , We get: 


“(x)' 6(X)9(X)' a=A9(X)' a 


o(X) 


Multiplying it by * on both sides yields the following result: 
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; 


-#(X)@(XJ' O(X)O(X) a= 20(X) (XY a 


=> = 9$(X)¢(X)' a= 2a 


i 


| 
=> —Ka=Aa 
i 


Here, K is the similarity (kernel) matrix: 


K = 9(X)9(X) 


g 


As we recall from the Solving nonlinear problems using a kernel SVM section in 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, we use the 


kernel trick to avoid calculating the pairwise dot products of the samples x under p 
explicitly by using a kernel function “ so that we don't need to calculate the 
eigenvectors explicitly: 


X| x). x) _ o( yl ) 0 x") 


In other words, what we obtain after kernel PCA are the samples already projected 
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onto the respective components, rather than constructing a transformation matrix as 
in the standard PCA approach. Basically, the kernel function (or simply kernel) can 
be understood as a function that calculates a dot product between two vectors—a 
measure of similarity. 


The most commonly used kernels are as follows: 


e The polynomial kernel: 


K(x, x!) = ("x4 ( +0)’ 


Here, 6 is the threshold and /’ is the power that has to be specified by the 
user. 
e The hyperbolic tangent (sigmoid) kernel: 


ea) = tanh [sya +0) 


e The Radial Basis Function (RBF) or Gaussian kernel, which we will use 1n the 
following examples 1n the next subsection: 


fr 7% 
J) Aa 
, | x — x 

. aWF al i) ‘“ | 
(| x"',0°' }=exp| — : 





It is often written in the ae form, introducing the variable | 2a; 


“f) 





K(x", | x") = exp(-r|. 


To summarize what we have learned so far, we can define the following three steps 
to implement an RBF kernel PCA: 
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1. We compute the kernel (similarity) matrix K, where we need to calculate the 
following: 


}=exp (—y | 


L 
(i) —0y) 
Po 


K(x yx (7) Ay) 


xX 








We do this for each pair of samples: 


m | ede || eae oe oe el | 


fa f2) AN | wiz) A? tA) fr) } 
exe) (x29) a w(2) 
a ae | ce | os a(x ) 


For example, if our dataset contains 100 training samples, the symmetric kernel 
matrix of the pairwise similarities would be 100 x 100-dimensional. 
2. We center the kernel matrix K using the following equation: 


K'=K-1 K-K1 +1 KL 


Here, “ isan /!*/' -dimensional matrix (the same dimensions as the kernel 


l 


matrix) where all values are equal to ” . 

3. We collect the top k eigenvectors of the centered kernel matrix based on their 
corresponding eigenvalues, which are ranked by decreasing magnitude. In 
contrast to standard PCA, the eigenvectors are not the principal component 
axes, but the samples already projected onto these axes. 


At this point, you may be wondering why we need to center the kernel matrix 1n the 
second step. We previously assumed that we are working with standardized data, 
where all features have mean zero when we formulated the covariance matrix and 
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replaced the dot-products with the nonlinear feature combinations via p . Thus, the 
centering of the kernel matrix in the second step becomes necessary, since we do not 
compute the new feature space explicitly so that we cannot guarantee that the new 
feature space is also centered at zero. 


In the next section, we will put those three steps into action by implementing a 
kernel PCA in Python. 
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Implementing a Kernel principal component 
analysis in Python 


In the previous subsection, we discussed the core concepts behind kernel PCA. Now, 
we are going to implement an RBF kernel PCA in Python following the three steps 
that summarized the kernel PCA approach. Using some SciPy and NumPy helper 
functions, we will see that implementing a kernel PCA 1s actually really simple: 


from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def rbf kernel pca(X, gamma, n components): 


wesw 


RBF kernel PCA implementation. 


Parameters 


A? GNUMPY NWdarrayl,; Shape = im Samplcs; Tm -Teaturcs | 


gamma: float 
Tuning parameter of the RBF kernel 


i Components: 1c 
Number of principal components to return 


x DCs iNUMPY Noearreay);,; Shape = im. samples, kK. teatures) 
Projected dataset 


wesw 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
Sq. d1Sts = past (xX, “sdeucli1dean”) 


# Convert pairwise distances into a square matrix. 
Mal Sq O1Sts = Squarerorm(s¢d C1sts) 


# Compute the symmetric kernel matrix. 
K = exp(-gamma * mat _ sq dists) 


+ Center the kernel Matrix. 


N = K.shape[0] 
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one n = np.ones((N,N)) / N 
K = K ~ One N.doultk) = KedOt (one 1) + One m.0cOe (Kh) «col (one: 1) 


# Obtaining eigenpairs from the centered kernel matrix 
# scipy.linalg.eigh returns them in ascending order 
eigvals, eigvecs = eigh(k) 

eigvals, eigvecs = eigvals[::-l1], eligvecs[:, ::-1] 


# Collect the top k eigenvectors (projected samples) 
xX. pe. = np.column stack ((e1gvecs |: , 2! 
for 1 in range(n components) ) ) 


return x~ pc 


One downside of using an RBF kernel PCA for dimensionality reduction is that we 


have to specify the y parameter a priori. Finding an appropriate value for y 
requires experimentation and 1s best done using algorithms for parameter tuning, for 
example, performing a grid search, which we will discuss in more detail 1n Chapter 
6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning. 


Example 1 — separating half-moon shapes 


Now, let us apply our rbf kernel pca on some nonlinear example datasets. We will 
start by creating a two-dimensional dataset of 100 sample points representing two 
half-moon shapes: 


>>> from sklearn.datasets import make moons 

PoP? hy Y = Make moons(m Saemples=100, tandom state=1273) 
>>> plt.scatter (X[y==0, O], X[y==0, 1], 

Lae color='red', marker='*', alpha=0.5) 
>>> plt.scatter (X[y==1, 0], X[y==1, 1], 

a color='blue', marker='o', alpha=0.5) 
>>> plt.show() 


For the purposes of illustration, the half-moon of triangle symbols shall represent 
one class, and the half-moon depicted by the circle symbols represent the samples 
from another class: 
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A A 
A A 
A A 
A A 
A A 
A A 
A A 
A A 
A A 
A A 
A A 


oO 





Clearly, these two half-moon shapes are not linearly separable, and our goal is to 
unfold the half-moons via kernel PCA so that the dataset can serve as a suitable input 
for a linear classifier. But first, let's see how the dataset looks 1f we project it onto the 
principal components via standard PCA: 


>>> from sklearn.decomposition import PCA 

Poe SCIRIL. pCa = PCAC Components=2) 

eer mos pCa = SCIKIL pCa.tit. transl orm (x) 

>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
>>> ax[Q].scatter(X spca[y==-0, 0], X spcaly==0, 1], 

ae color='red', marker='*', alpha=0.5) 

vor Ox) 0|«SCalter(x SspCcaly==ly Uly *% spcaly=—=l, ti, 

ee color='blue', marker='o', alpha=0.5) 
pee ax || «<BCaluer (x. Spcaly=—-0, Ol, DP.2er0s ( (50,1) ) 70.02, 
oe color='red', marker='*', alpha=0.5) 

Poe tai | seCacle (XX SPCaly==l, Ul, New7etos( (50,1) )-0.02, 
see color='blue', marker='o', alpha=0.5) 
Ze a) aoek. «Wael (PCL?) 

eer ex 0) seer Vilabel (C2) 
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>>> ax[1].set ylim([-1, 1]) 
Sor ex il seee Veeck |i) 
Pee ool eee. Lee ee) 


>>> plt.show() 


Clearly, we can see in the resulting figure that a linear classifier would be unable to 
perform well on the dataset transformed via standard PCA: 





Note that when we plotted the first principal component only (right subplot), we 
shifted the triangular samples slightly upwards and the circular samples slightly 
downwards to better visualize the class overlap. As the left subplot shows, the 
original half-moon shapes are only slightly sheared and flipped across the vertical 
center—this transformation would not help a linear classifier in discriminating 
between circles and triangles. Similarly, the circles and triangles corresponding to 
the two half-moon shapes are not linearly separable 1f we project the dataset onto a 
one-dimensional feature axis, as shown in the right subplot. 


Note 


Please remember that PCA 1s an unsupervised method and does not use class label 
information in order to maximize the variance in contrast to LDA. Here, the triangle 
and circle symbols were just added for visualization purposes to indicate the degree 
of separation. 


Now, let us try out our kernel PCA function rbf kernel pca, which we 
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implemented in the previous subsection: 


>>> X kpca = rbf kernel pca(X, gamma=15, n_components=2z) 
Poo Tig, ax = DPLU.SuUbpLOUsS (nrows=l,nCols=Z,. Figs176e=(7,3)) 
Peo ax |) | «SsCalter (x KpCaly==0;. Ul, x KoCaly=-0,. 11, 

oT color='red', marker='*', alpha=0.5) 

>>> ax[Q].scatter(X kpca[y==l, 0], X_kpcaly==1, 1], 

—e color='blue', marker='o', alpha=0.5) 

Por ax | .)| «SCallter(x Kocaly==0, Ul, Np.zeros(.(o0; 1) )70.0z2, 
Te color='red', marker='*', alpha=0.5) 

Pee ax il seCacle (% KoCaly==l, Ul, NP~Zeros( (50,1))-=0302, 
ee color='blue', marker='o', alpha=0.5) 

>>> ax 


leek. aoe. (Pol) 
per ax|0| «60C Vlabel( PCZ” ) 
Por ax |i seer Lim li, 24) 
2 > ex| i) .sevu_ yuicks([]) 
ere aX ||| ¢seu. Klabed ( PCh) 


>>> plt.show() 


We can now see that the two classes (circles and triangles) are linearly well 
separated so that it becomes a suitable training dataset for linear classifiers: 


A 
A 
A 
A 
A 
A 





Unfortunately, there is no universal value for the tuning parameter ’ that works 


well for different datasets. Finding a ’ value that is appropriate for a given problem 
requires experimentation. In Chapter 6, Learning Best Practices for Model 
Evaluation and Hyperparameter Tuning, we will discuss techniques that can help us 
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to automate the task of optimizing such tuning parameters. Here, I will use values for 


’ that I found produce good results. 


Example 2 — separating concentric circles 


In the previous subsection, we showed how to separate half-moon shapes via kernel 
PCA. Since we put so much effort into understanding the concepts of kernel PCA, let 
us take a look at another interesting example of a nonlinear problem, concentric 
circles: 


vor Lrom sklearn.datasets import make circles 

Poe xy VY = Make circles(m samples-1000, 

a ranoom, State=1i725, Norse=U.l,. Taccor—U.2) 
>>> plt.scatter (X[y==0, O], X[y==0, 1], 

se color='red', marker='*', alpha=0.5) 

SoS ple.sscatrer(<|y-=l, Oly xXly=i, Tl, 

sae color='"blue', marker='o', alpha=0.5) 

>>> plt.show() 


Again, we assume a two-class problem where the triangle shapes represent one class, 
and the circle shapes represent another class: 
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Let's start with the standard PCA approach to compare it to the results of the RBF 


kernel PCA: 

Poo SCILKIG. PCa = PCA. Componenus= ) 

Poe ee. OpCe = SCIRLE PCast it transtorm(%) 

>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
Poor ax | 0) «SCatler(x spcal|y——0, Ul, ~ Spcaly——U, Dl, 

ae color='red', marker='*', alpha=0.5) 

oP? Bx| 0) «SCACUCr( x SpCaly==l, Ol, A. spcaly——=1,- Li, 

ae color='blue', marker='o', alpha=0.5) 

Po EX |1.| «SCalver( x Spcaly==0, Ul, MpyzZeros((50U,1))70<02, 
oe color='red', marker='*', alpha=0.5) 

poo? ex lid] esCaleer (xX SpCaly==1, Uly Dpbemeros (1000 ,1) )-0.02,; 
Te color='blue', marker='o', alpha=0.5) 

poe Ax).0) —SeC_ xlabel(* PCL) 

er? ax |0].SeC Yilabel("PC2”) 

Pee ax|i,)| set ylamtl=l, 1)) 

Por BX | 1) e86n VYuickst | 1) 

For em hd oe. lave Lil Eel” ) 

>>> plt.show() 
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Again, we can see that standard PCA 1s not able to produce results suitable for 
training a linear classifier: 


PC2 


0.5 





0.0 
PC1 


1.0 =i: =5 0.5 1.0 


Given an appropriate value for y , let us see 1f we are luckier using the RBF kernel 
PCA implementation: 


>>> 
>>> 
2 


>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 


>>> 
>>> 


x: Kpea = 
EiLGy ax 
ax[Q].scatter(X kpcaly==0, 


color='red', 


EDE: Kernel Ca (x, 


ax [0] «sCatter(x kpcaly==1, 
COlLor=—"D1luS"* ,; 
ax, | «SCaller(xX Kpcaly==0; 
color='red', 
ax[1) «scatter (x% kKpca|y==1, 
Color="pDiue*,; 
ax] 0] «Seu. xlabel(*PCl*) 
axiO0] sseb. ylabel(*PC2*) 
axl Li sseu yiime[=ly 1) ) 
ax|l | .Seu_yUicks((]) 
ax[l].set xlabel('PC1') 
plt.show () 


plt.subplots (nrows=1,ncols=2, 


gamma=15, n_components=2) 
figsize=(7,3)) 
O], X_kpcaly==0, 1], 
marker='*', alpha=0.5) 

QO], X_kpcaly==1, 1], 
marker='o', alpha=0.5) 

Ul, DDp.Zeros ((500;,1))70.02; 
marker='*', alpha=0.5) 

O], np.zeros((500,1))-0.02, 
marker='o!', alpha=0.5) 


Again, the RBF kernel PCA projected the data onto a new subspace where the two 
classes become linearly separable: 
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Projecting new data points 


In the two previous example applications of kernel PCA, the half-moon shapes and 
the concentric circles, we projected a single dataset onto a new feature. In real 
applications, however, we may have more than one dataset that we want to 
transform, for example, training and test data, and typically also new samples we 
will collect after the model building and evaluation. In this section, you will learn 
how to project data points that were not part of the training dataset. 


As we remember from the standard PCA approach at the beginning of this chapter, 
we project data by calculating the dot product between a transformation matrix and 
the input samples; the columns of the projection matrix are the top k eigenvectors (Vv) 
that we obtained from the covariance matrix. 


Now, the question is how we can transfer this concept to kernel PCA. If we think 
back to the idea behind kernel PCA, we remember that we obtained an eigenvector 
(a) of the centered kernel matrix (not the covariance matrix), which means that those 
are the samples that are already projected onto the principal component axis v. Thus, 


if we want to project a new sample * onto this principal component axis, we'd need 
to compute the following: 


p(x’) v 


Fortunately, we can use the kernel trick so that we don't have to calculate the 


p(x’) v 


projection explicitly. However, it 1s worth noting that kernel 
PCA, 1n contrast to standard PCA, is a memory-based method, which means that we 
have to re-use the original training set each time to project new samples. We have to 
calculate the pairwise RBF kernel (similarity) between each ith sample in the 


training dataset and the new sample * : 
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d(x’) y v= Diao o(x') of x") 
= 2%x( x52) 
. 3 


Here, the eigenvectors a and eigenvalues A of the kernel matrix K satisfy the 
following condition in the equation: 


ha 


After calculating the similarity between the new samples and the samples in the 
training set, we have to normalize the eigenvector a by its eigenvalue. Thus, let us 
modify the rbf kernel pca function that we implemented earlier so that it also 
returns the eigenvalues of the kernel matrix: 


from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def rbf kernel pca(X, gamma, n components) : 


RBF kernel PCA implementation. 
Paramecers 


x 4NUMPY NGarrayt; Siape = [Mm samples, mn Teaturcs| 


gamma: float 
Tuning parameter of the RBF kernel 


iy Componente: 2 
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Number of principal components to return 


Returns 


x pc: iNUMPY NGaerray), Shape — [tm samples, k. Teavures|) 
Projected dataset 


lambdas: list 
Eigenvalues 


wesw 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
Sq Cists = parst(x, “sqeuci1dean”) 


# Convert pairwise distances into a square matrix. 
Mel. SQ Clses = Squarerormisd discs) 


# Compute the symmetric kernel matrix. 
K. = exp (-gamma ~ mat Sq dists) 


# Center the kernel matrix. 

N = K.shape[0] 

one n = np.ones((N,N)) / N 

k= k~ One DsdoL(K) — KsCoOt (One 1) = One m.COr (Kh) «cob (one mm) 


# Obtaining eigenpairs from the centered kernel matrix 
# scipy.linalg.eigh returns them in ascending order 
eigvals, eligvecs = eigh(K) 

eigvals, eigvecs = eigvals[::-l1], eligvecs[:, ::-1] 


# Collect the top k eigenvectors (projected samples) 
alphas = NOcOlmn Stack (ergvece|.7 1 


Or 2.1m range (n Componenrs).)) 


# Collect the corresponding eigenvalues 
lambdas = [eigvals[i] for 1 in range(n components) ] 


return alphas, lambdas 


Now, let's create a new half-moon dataset and project it onto a one-dimensional 
subspace using the updated RBF kernel PCA implementation: 


Po? ky VY = Make moons (n samples=100, random state=123) 
Per @elipomas, Lamboas = Lor kernel pcaix, GammMa=lo,; mM.Ccomponents=— 1) 


To make sure that we implemented the code for projecting new samples, let us 
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assume that the 26th point from the half-moon dataset is a new data point * , and 
our task 1s to project 1t onto this new subspace: 


Pom x new = KZ 5] 

>>> xX new 

array([ 1.8713187 , 0.00928245]) 

>>> x proj] = alphas[25] # original projection 

>>> X prod 

array([ 0.07877284]) 

>>> def project x(x new, X, gamma, alphas, lambdas) : 
Palit 01st = fp.arvay( inp.cums 

(x new-row) **2) for row in X]) 

k = p<exp(-Gamma * pair dist) 
return k.dot(alphas / lambdas) 


By executing the following code, we are able to reproduce the original projection. 
Using the project x function, we will be able to project any new data sample as 
well. The code 1s as follows: 


>>> X reproj = project x(x new, X, 

os gamma=15, alphas=alphas, lambdas=lambdas) 
PoP & LSEDTO] 
array (|. 0,07¢71234)) 


Lastly, let's visualize the projection on the first principal component: 


>>> plt.scatter(alphas[y==0, 0], np.zeros((50)), 

oe color='red', marker='*',alpha=0.5) 

>>> plt.scatter(alphas[y==1, 0], np.zeros((50)), 

cae color='"blue', marker='o', alpha=0.5) 

Poor Plisecattert(s Proj, VU, COlOr="Dback"; 
Fabel=—"Orizgine) Pro ,eCtLon OF pointe XiZ5)*, 

eee marker='*', s=100) 

Poo. Diese Catter (x. teptoj, OU, Color "G7ee”, 
label='remapped point X[25]', 

S48 marker=!'x!', s=500) 

>>> plt.legend(scatterpoints=1) 

>>> plt.show() 


As we can now also see in the following scatterplot, we mapped the sample * onto 
the first principal component correctly: 
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A. original projection of point X[25] 
remapped point X[25] 


-~0.15 0.10 -~0.05 
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Kernel principal component analysis in scikit- 
learn 


For our convenience, scikit-learn implements a kernel PCA class 1n the 
sklearn.decomposition submodule. The usage is similar to the standard PCA class, 
and we can specify the kernel via the kernel parameter: 


>>> from sklearn.decomposition import KernelPCA 

Po? Ky Y = Make moons (mn samples-l00, tandom state-12Z3) 
PoP SCIiRLL KoCa = Kerne | PCA(n Components=Z, 

ante kernel='rbf', gamma=15) 

poe A. SKOLApCa = SCLKILE KpCa.t ee Eransrtorm (x) 


To check that we get results that are consistent with our own kernel PCA 
implementation, let's plot the transformed half-moon shape data onto the first two 
principal components: 


or PaueecCaleer (% skernpcaly——U; Ul, % Sskermpcaly-—-0, Al, 
-ee color='red', marker='*', alpha=0.5) 

vee DPilceCaluer (% Skernpca|y—=l, Ul, x Skernpcaly=—-1, Li, 
cas color='blue', marker='o', alpha=0.5) 

oo D1. stlabeilt* PCL) 

Peo plus ylabeli.("PeZ*) 

>>> plt.show() 


As we can see, the results of scikit-learn's Kerne1 Pca are consistent with our own 
implementation: 
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Note 


The scikit-learn library also implements advanced techniques for nonlinear 
dimensionality reduction that are beyond the scope of this book. The interested 
reader can find a nice overview of the current implementations in scikit-learn, 
complemented by illustrative examples, at http://scikit- 


learn.org/stable/modules/manifold. html. 
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Summary 


In this chapter, you learned about three different, fundamental dimensionality 
reduction techniques for feature extraction: standard PCA, LDA, and kernel PCA. 
Using PCA, we projected data onto a lower-dimensional subspace to maximize the 
variance along the orthogonal feature axes, while ignoring the class labels. LDA, in 
contrast to PCA, 1s a technique for supervised dimensionality reduction, which 
means that it considers class information in the training dataset to attempt to 
maximize the class-separability 1n a linear feature space. 


Lastly, you learned about a nonlinear feature extractor, kernel PCA. Using the kernel 
trick and a temporary projection into a higher-dimensional feature space, you were 
ultimately able to compress datasets consisting of nonlinear features onto a lower- 
dimensional subspace where the classes became linearly separable. 


Equipped with these essential preprocessing techniques, you are now well prepared 
to learn about the best practices for efficiently incorporating different preprocessing 
techniques and evaluating the performance of different models in the next chapter. 
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Chapter 6. Learning Best Practices for 
Model Evaluation and Hyperparameter 
Tuning 


In the previous chapters, you learned about the essential machine learning algorithms 
for classification and how to get our data into shape before we feed it into those 
algorithms. Now, it's time to learn about the best practices of building good machine 
learning models by fine-tuning the algorithms and evaluating the model's 
performance! In this chapter, we will learn how to do the following: 


Obtain unbiased estimates of a model's performance 
Diagnose the common problems of machine learning algorithms 
Fine-tune machine learning models 


@ 
@ 
@ 
e Evaluate predictive models using different performance metrics 
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Streamlining workflows with pipelines 


When we applied different preprocessing techniques in the previous chapters, such 
as standardization for feature scaling in Chapter 4, Building Good Training Sets — 
Data Preprocessing, or principal component analysis for data compression 1n 
Chapter 5, Compressing Data via Dimensionality Reduction, you learned that we 
have to reuse the parameters that were obtained during the fitting of the training data 
to scale and compress any new data, such as the samples in the separate test dataset. 
In this section, you will learn about an extremely handy tool, the Pipeline class in 
scikit-learn. It allows us to fit a model including an arbitrary number of 
transformation steps and apply it to make predictions about new data. 
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Loading the Breast Cancer Wisconsin dataset 


In this chapter, we will be working with the Breast Cancer Wisconsin dataset, which 
contains 569 samples of malignant and benign tumor cells. The first two columns in 
the dataset store the unique ID numbers of the samples and the corresponding 
diagnoses (mM = malignant, B = benign), respectively. Columns 3-32 contain 30 real- 
valued features that have been computed from digitized images of the cell nuclei, 
which can be used to build a model to predict whether a tumor is benign or 
malignant. The Breast Cancer Wisconsin dataset has been deposited in the UCI 
Machine Learning Repository, and more detailed information about this dataset can 


be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+ 
(Diagnostic). 


Note 


You can find a copy of the breast cancer dataset (and all other datasets used in this 
book) in the code bundle of this book, which you can use if you are working offline 
or the UCI server at https://archive.ics.uci.edu/ml/machine-learning- 
databases/breast-cancer-wisconsin/wdbc.data is temporarily unavailable. For 
instance, to load the Wine dataset from a local directory, you can take the following 
lines: 


df = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases' 
'/breast-cancer-wisconsin/wdbc.data', 
header=None) 


Replace the preceding lines with this: 


df = pd.read csv('your/local/path/to/wdbc.data', 
header=None) 


In this section, we will read in the dataset and split it into training and test datasets in 
three simple steps: 


1. We will start by reading in the dataset directly from the UCI website using 


pandas: 


>>> import pandas as pd 

>>> di = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases' 
'/breast-cancer-wisconsin/wdbc.data', 
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header=None) 


2. Next, we assign the 30 features to a NumPy array x. Using a LabelEncoder 
object, we transform the class labels from their original string representation 
('m' and 'B') into integers: 


>>> from sklearn.preprocessing import LabelEncoder 


eoo> m= Gdr.locli=, 22 ).values 
Poe yy = Of.lo0c|(t;, L)«valkues 
>>> le = LabelEncoder () 

2o7 Y= Meet, Trane lori) 

277 ea Lasees _ 

array(['B', 'M'], dtype=object) 


After encoding the class labels (diagnosis) in an array y, the malignant tumors 
are now represented as class 1, and the benign tumors are represented as class 0, 
respectively. We can double-check this mapping by calling the transform 
method of the fitted LabelEncoder on two dummy class labels: 


>>> le. transtrorm(|'M*, eB") 
array([1, 0]) 


3. Before we construct our first model pipeline in the following subsection, let us 
divide the dataset into a separate training dataset (80 percent of the data) and a 
separate test dataset (20 percent of the data): 


Zo? ©EOM SKIGArT.MOCel SeleceLon IMpPOre Eran Lest spilt 


>>> X_ train, X_ test, y_ train, y test = \ 
>>> Cilain test Ssplitix, yy; 
Lest. 61726=0.4.20;, 
Sstratify=y, 
Fandom. State) 
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Combining transformers and estimators In a 
pipeline 


In the previous chapter, you learned that many learning algorithms require input 
features on the same scale for optimal performance. Thus, we need to standardize the 
columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear 
classifier, such as logistic regression. Furthermore, let's assume that we want to 
compress our data from the initial 30 dimensions onto a lower two-dimensional 
subspace via Principal Component Analysis (PCA), a feature extraction technique 
for dimensionality reduction that we introduced in Chapter 5, Compressing Data via 
Dimensionality Reduction. 


Instead of going through the fitting and transformation steps for the training and test 
datasets separately, we can chain the StandardScaler, PCA, and 
LogisticRegression objects in a pipeline: 
>>> from sklearn.preprocessing import StandardScaler 
>>> from sklearn.decomposition import PCA 
vee Tem. Sklearm..inear model ampore. LOgisl1 CRheg essi0n 
>>> from sklearn.pipeline import make pipeline 
Por Pipe Le = Make. pipeline (stancarcscaler(), 

PCAs. Components =Z) , 
ewe LOGiSttCRegression (random sStace= 1) 
por Dipe: Lisle (xX ECrainy Y train) 
eee FY pred = pipe. | f.predicri x Tesr) 
eo Prime Test, ACCUracy: ““uor° @ Plpe jy.scorve(% test, YY test), 
Test Accuracy: 0.956 


The make pipeline function takes an arbitrary number of scikit-learn transformers 
(objects that support the fit and transform methods as input), followed by a scikit- 
learn estimator that implements the fit and predict methods. In our preceding code 
example, we provided two transformers, StandardScaler and Pca, anda 
LogisticRegression estimator as inputs to the make pipeline function, which 
constructs a scikit-learn Pipeline object from these objects. 


We can think of a scikit-learn Pipeline as a meta-estimator or wrapper around those 
individual transformers and estimators. If we call the fit method of Pipeline, the 
data will be passed down a series of transformers via fit and transform calls on 
these intermediate steps until it arrives at the estimator object (the final element in a 
pipeline). The estimator will then be fitted to the transformed training data. 
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When we executed the fit method on the pipe ir pipeline in the preceding code 
example, StandardScaler first performed fit and transform calls on the training 
data. Second, the transformed training data was passed on to the next object in the 
pipeline, pca. Similar to the previous step, Pca also executed fit and transform on 
the scaled input data and passed it to the final element of the pipeline, the estimator. 


Finally, the LogisticRegression estimator was fit to the training data after it 
underwent transformations via StandardScaler and pca. Again, we should note that 
there 1s no limit to the number of intermediate steps 1n a pipeline; however, the last 
pipeline element has to be an estimator. 


Similar to calling £it on a pipeline, pipelines also implement a predict method. If 
we feed a dataset to the predict call of a Pipeline object instance, the data will pass 
through the intermediate steps via transform calls. In the final step, the estimator 
object will then return a prediction on the transformed data. 


The pipelines of scikit-learn library are immensely useful wrapper tools, which we 
will use frequently throughout the rest of this book. To make sure that you've got a 
good grasp of how Pipeline object works, please take a close look at the following 
illustration, which summarizes our discussion from the previous paragraphs: 
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(Step 2) 


S |) 
GEISserlsy=s (step 4) 


Training set 











pipeLine.fit(...) 





Pipeline 


Dimensionality 
Reduction 












fit(...) & 


.transform(...) 
.transform(...) | 


sELEC..) & 
.transform(...) 






Learning Algorithm | -transform(...) 


Predictive Model 







£1tC...) 


.predict(...) 
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pipeLine.predict(...) 


>| Class labels 


Using k-fold cross-validation to assess 
model performance 


One of the key steps in building a machine learning model 1s to estimate its 
performance on data that the model hasn't seen before. Let's assume that we fit our 
model on a training dataset and use the same data to estimate how well it performs 
on new data. We remember from the Tackling overfitting via regularization section 
in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, that a 
model can either suffer from underfitting (high bias) 1f the model 1s too simple, or it 
can overfit the training data (high variance) if the model is too complex for the 
underlying training data. 


To find an acceptable bias-variance trade-off, we need to evaluate our model 
carefully. In this section, you will learn about the common cross-validation 
techniques holdout cross-validation and k-fold cross-validation, which can help us 
obtain reliable estimates of the model's generalization performance, that is, how well 
the model performs on unseen data. 
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The holdout method 


A classic and popular approach for estimating the generalization performance of 
machine learning models 1s holdout cross-validation. Using the holdout method, we 
split our initial dataset into a separate training and test dataset—the former is used 
for model training, and the latter 1s used to estimate its generalization performance. 
However, in typical machine learning applications, we are also interested in tuning 
and comparing different parameter settings to further improve the performance for 
making predictions on unseen data. This process is called model selection, where the 
term model selection refers to a given classification problem for which we want to 
select the optimal values of tuning parameters (also called hyperparameters). 
However, 1f we reuse the same test dataset over and over again during model 
selection, it will become part of our training data and thus the model will be more 
likely to overfit. Despite this issue, many people still use the test set for model 
selection, which is not a good machine learning practice. 


A better way of using the holdout method for model selection is to separate the data 
into three parts: a training set, a validation set, and a test set. The training set is used 
to fit the different models, and the performance on the validation set 1s then used for 
the model selection. The advantage of having a test set that the model hasn't seen 
before during the training and model selection steps is that we can obtain a less 
biased estimate of its ability to generalize to new data. The following figure 
illustrates the concept of holdout cross-validation, where we use a validation set to 
repeatedly evaluate the performance of the model after training using different 
parameter values. Once we are satisfied with the tuning of hyperparameter values, 
we estimate the models' generalization performance on the test dataset: 
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Original dataset 


Change hyperparameters 
and repeat 




















Machine learning ‘J 
algorithm 


Evaluate 
Fit | 


Predictive model } — " 
, Final performance estimate 


A disadvantage of the holdout method 1s that the performance estimate may be very 
sensitive to how we partition the training set into the training and validation subsets; 
the estimate will vary for different samples of the data. In the next subsection, we 
will take a look at a more robust technique for performance estimation, k-fold cross- 
validation, where we repeat the holdout method k times on & subsets of the training 
data. 
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K-fold cross-validation 


In k-fold cross-validation, we randomly split the training dataset into k folds without 
replacement, where k — | folds are used for the model training, and one fold is used 
for performance evaluation. This procedure is repeated & times so that we obtain k 
models and performance estimates. 


Note 


We looked at an example to illustrate sampling with and without replacement in 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. If you haven't 
read that chapter, or want a refresher, refer to the information box in the Combining 
multiple decision trees via random forests section in Chapter 3, A Tour of Machine 
Learning Classifiers Using scikit-learn. 


We then calculate the average performance of the models based on the different, 
independent folds to obtain a performance estimate that is less sensitive to the sub- 
partitioning of the training data compared to the holdout method. Typically, we use 
k-fold cross-validation for model tuning, that 1s, finding the optimal hyperparameter 
values that yields a satisfying generalization performance. 


Once we have found satisfactory hyperparameter values, we can retrain the model on 
the complete training set and obtain a final performance estimate using the 
independent test set. The rationale behind fitting a model to the whole training 
dataset after k-fold cross-validation 1s that providing more training samples to a 
learning algorithm usually results in a more accurate and robust model. 


Since k-fold cross-validation 1s a resampling technique without replacement, the 
advantage of this approach 1s that each sample point will be used for training and 
validation (as part of a test fold) exactly once, which yields a lower-variance 
estimate of the model performance than the holdout method. The following figure 
summarizes the concept behind k-fold cross-validation with k = /0. The training 
dataset is divided into 10 folds, and during the 10 iterations, nine folds are used for 
training, and one fold will be used as the test set for the model evaluation. Also, the 


estimated performances E, (for example, classification accuracy or error) for each 
fold are then used to calculate the estimated average performance E of the model: 
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Training folds Test fold 


|** iteration 








10" iteration 


mS Ev 





A good standard value for & in k-fold cross-validation is 10, as empirical evidence 
shows. For instance, experiments by Ron Kohavi on various real-world datasets 
suggest that 10-fold cross-validation offers the best trade-off between bias and 
variance (A Study of Cross-Validation and Bootstrap for Accuracy Estimation and 
Model Selection, Kohavi, Ron, International Joint Conference on Artificial 
Intelligence (IJCAI), 14 (12): 1137-43, 1995). 


However, if we are working with relatively small training sets, it can be useful to 
increase the number of folds. If we increase the value of k, more training data will be 
used in each iteration, which results in a lower bias towards estimating the 
generalization performance by averaging the individual model estimates. However, 
large values of & will also increase the runtime of the cross-validation algorithm and 
yield estimates with higher variance, since the training folds will be more similar to 
each other. On the other hand, if we are working with large datasets, we can choose a 
smaller value for k, for example, k = 5, and still obtain an accurate estimate of the 
average performance of the model while reducing the computational cost of refitting 
and evaluating the model on the different folds. 


Note 
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A special case of k-fold cross-validation is the Leave-one-out cross-validation 
(LOOCV) method. In LOOCV, we set the number of folds equal to the number of 
training samples (k = n) so that only one training sample is used for testing during 
each iteration, which is a recommended approach for working with very small 
datasets. 


A slight improvement over the standard k-fold cross-validation approach 1s stratified 
k-fold cross-validation, which can yield better bias and variance estimates, especially 
in cases of unequal class proportions, as has been shown in a study by Ron Kohavi 
(A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model 
Selection, International Joint Conference on Artificial Intelligence (IJCAI), 14 (12): 
1137-43, 1995). In stratified cross-validation, the class proportions are preserved in 
each fold to ensure that each fold is representative of the class proportions in the 
training dataset, which we will illustrate by using the stratifiedkKFold iterator in 
scikit-learn: 


>>> import numpy as np 
vo? LOM. Ski earm.model Selecliom 2mMporl Sebati tiedKrold 


Poe KEOLG = olraCl Ee Leckrolcih slit 10, 
fandom. State=l) .split(x train, 
see VY. train) 
>>> scores = [] 
>>> for k, (train, test) in enumerate (kfold): 
pipe. Leib xX Peele lea), Vereen) 
ecole = pire 1v.oCore( Viator, Y ere tes el 
scores.append (score) 


Perc *PoLas ~2d, Class Gist.) co, ACC? Geo -¢ Ceri, 
care Nips biMcount(y Treini|train|), Score) ) 
Poa: i, Chace diet.:. Zoe. LS3)> Aces. 0.935 
POWs gz, Chass Gust.¢ [256 osl, Ace? Us9so 
Pold: 3; Class dict.¢ 1256 53], AcCcr 0.957 
POlG? 4, Class dust. [250 153), Acc? 0.957 
POlG? 5, Class dust..2 [256 Tos), Ace? 0.955 
POoLd: 6, Class dnst.2 257 153), Aces 0.956 
POLOs if, Class Gistu.% i257 loo], Ace? 0.970 
HOdsce: “of, Wises: CaGh.s zor ISoi),y. 2eCee UW. 333 
Pold: 9, Chass: Gdists.: [257 ESsl, Accs 0.950 
BOLO: 10, “Class dist.¢ |Zo7 Id3l, Acer 3956 


Soo Prine (* \nCV. accuracy: <@.3f. 1/=. 3.31" % 
sae (np.mean(scores), np.std(scores) ) ) 
CV eccuracy: 0.950 +7/7= 0.014 
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First, we initialized the stratifiedKfold iterator from the 

sklearn.model selection module with the y train class labels in the training set, 
and we specified the number of folds via the n splits parameter. When we used the 
kfold iterator to loop through the k folds, we used the returned indices 1n train to fit 
the logistic regression pipeline that we set up at the beginning of this chapter. Using 
the pipe _1r pipeline, we ensured that the samples were scaled properly (for instance, 
standardized) in each iteration. We then used the test indices to calculate the 
accuracy score of the model, which we collected in the scores list to calculate the 
average accuracy and the standard deviation of the estimate. 


Although the previous code example was useful to illustrate how k-fold cross- 
validation works, scikit-learn also implements a k-fold cross-validation scorer, which 
allows us to evaluate our model using stratified k-fold cross-validation less 
verbosely: 


Por LEOm. SkiGarm.mMOdel Sehecte1on IMport Cross val score 


eo SCOreES = Ciloss Val Score (est imator—pipe tir, 
X=X_ train, 
=F ea; 
Cvy=10; 
eis iy JObSs=L) 
>>> print('CV accuracy scores: %s' % scores) 
CY accuracy scores? | 0.9547026) 0.9347026)1 0.956521 74 
UeDOO0Z1 74 “029s 76z201. Us.959550060 
Oger IIT To Oatogose jo. UL9595 556 
OPEC eis 67972 oon 
>>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), 
ee np.std(scores) )) 
Cy eeeuracy: 0,050 += 0.014 


An extremely useful feature of the cross val score approach is that we can 
distribute the evaluation of the different folds across multiple CPUs on our machine. 
If we set the n jobs parameter to 1, only one CPU will be used to evaluate the 
performances, just like in our StratifiedKFold example previously. However, by 
setting n jobs=2, we could distribute the 10 rounds of cross-validation to two CPUs 
Gf available on our machine), and by setting n_ jobs=-1, we can use all available 
CPUs on our machine to do the computation in parallel. 


Note 


Please note that a detailed discussion of how the variance of the generalization 
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performance is estimated in cross-validation is beyond the scope of this book, but I 
have written a series of articles about model evaluation and cross-validation that 
discuss these topics in more depth. These articles are available here: 


e https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html 
e https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html 
e https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3 html 


In addition, you can find a detailed discussion in this excellent article by M. 
Markatou and others (Analysis of Variance of Cross-validation Estimators of the 
Generalization Error, M. Markatou, H. Tian, S. Biswas, and G. M. Hripcsak, 
Journal of Machine Learning Research, 6: 1127-1168, 2005). 


You can also read about alternative cross-validation techniques, such as the .632 
Bootstrap cross-validation method Umprovements on Cross-validation: The .632+ 
Bootstrap Method, B. Efron and R. Tibshirani, Journal of the American Statistical 
Association, 92(438): 548-560, 1997). 
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Debugging algorithms with learning 
and validation curves 


In this section, we will take a look at two very simple yet powerful diagnostic tools 
that can help us improve the performance of a learning algorithm: learning curves 
and validation curves. In the next subsections, we will discuss how we can use 
learning curves to diagnose whether a learning algorithm has a problem with 
overfitting (high variance) or underfitting (high bias). Furthermore, we will take a 
look at validation curves that can help us address the common issues of a learning 
algorithm. 
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Diagnosing bias and variance problems with 
learning curves 


If a model is too complex for a given training dataset—there are too many degrees of 
freedom or parameters in this model—the model tends to overfit the training data 
and does not generalize well to unseen data. Often, it can help to collect more 
training samples to reduce the degree of overfitting. However, 1n practice, it can 
often be very expensive or simply not feasible to collect more data. By plotting the 
model training and validation accuracies as functions of the training set size, we can 
easily detect whether the model suffers from high variance or high bias, and whether 
the collection of more data could help address this problem. But before we discuss 
how to plot learning curves in scikit-learn, let's discuss those two common model 
issues by walking through the following illustration: 
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The graph in the upper-left shows a model with high bias. This model has both low 
training and cross-validation accuracy, which indicates that it underfits the training 
data. Common ways to address this issue are to increase the number of parameters of 
the model, for example, by collecting or constructing additional features, or by 
decreasing the degree of regularization, for example, in SVM or logistic regression 
classifiers. 


The graph in the upper-right shows a model that suffers from high variance, which is 
indicated by the large gap between the training and cross-validation accuracy. To 
address this problem of overfitting, we can collect more training data, reduce the 
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complexity of the model, or increase the regularization parameter, for example. For 
unregularized models, it can also help decrease the number of features via feature 
selection (Chapter 4, Building Good Training Sets — Data Preprocessing) or feature 
extraction (Chapter 5, Compressing Data via Dimensionality Reduction) to decrease 
the degree of overfitting. While collecting more training data usually tends to 
decrease the chance of overfitting, it may not always help, for example, if the 
training data 1s extremely noisy or the model 1s already very close to optimal. 


In the next subsection, we will see how to address those model issues using 
validation curves, but let's first see how we can use the learning curve function from 
scikit-learn to evaluate the model: 


>>> IMpOrt MatplLotlib.pyplLor as pit 
Pe? From Sskiearn.model selection aAMmport Jcarning Curve 


yo pipe. Lc = Make Pipeline (Sstancarcscaler{) , 
LogisticRegression (penalty='12', 
<a random state=1) ) 
>>> train sizes, train scores, test scores =\ 
tSarning Ccurve(estimator=pipe: lr, 

X=X train, 

V=V_ train, 

tieinl Sivzes=np..inspace: 

Oedy Led; 10), 


cv=10, 

i Oe =) 
Por teal. Meat = Np, mMean(Erain. SCOres, ax1Sssl) 
fer Tain SLO = No«Sta (Train Scores, 2xisS1) 
yor Tes. Mean = Mpaleanttest Scores, axis=i) 


Poe EeCL, SUG: = MDsscd leet SCoOlLeo, exis=!) 


Pe P.O LOC (Lait. isi Zec, Eat moan, 
COLOor=—"bi1ue"s Marker="O*, 
markersize=5, label='training accuracy') 


poe Diletta DelLWeei (Cleat sizes, 
biadin Mean + train Std, 
Cie. Meet = Tica CLG, 
alpha=0.15, color='blue') 


eo Pia. OLOUC( Era Sizes, Lest Mean, 
color='green', linestyle='--', 
marker='s', markers1ize=)5, 
label='validation accuracy') 


Per DLE.tiA. DStWeeCn(ETaln. SiZes, 
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LeSt Mean v Leer Std, 
tect. ean = Test Sua, 

er alpha=0.15, color='green') 
2 Die...) 
>>> plt.xlabel('Number of training samples') 
Por Dit. VLabe. (| Accuracy’ } 

>>> plt.legend(loc='"lower right") 

Por Olen TLIO Oe oy shel) 

>>> plt.show() 


After we have successfully executed the preceding code, we obtain the following 
learning curve plot: 


o#T ~— eee eS eee 
- @ 
eT -ar 


> 
U 
© 
_ 
3 
U 
UV 
<[ 


—@®— training accuracy 
-m- validation accuracy 


100 150 200 250 300 350 
Number of training samples 





Via the train sizes parameter in the learning curve function, we can control the 
absolute or relative number of training samples that are used to generate the learning 
curves. Here, we set train sizes=np.linspace(0.1, 1.0, 10) to use 10 evenly 
spaced, relative intervals for the training set sizes. By default, the learning curve 
function uses stratified k-fold cross-validation to calculate the cross-validation 
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accuracy of a classifier, and we set k=/0 via the cv parameter for 10-fold stratified 
cross-validation. Then, we simply calculated the average accuracies from the 
returned cross-validated training and test scores for the different sizes of the training 
set, which we plotted using Matplotlib's plot function. Furthermore, we added the 
standard deviation of the average accuracy to the plot using the £111 between 
function to indicate the variance of the estimate. 


As we can see 1n the preceding learning curve plot, our model performs quite well on 
both the training and validation dataset if it had seen more than 250 samples during 
training. We can also see that the training accuracy increases for training sets with 
fewer than 250 samples, and the gap between validation and training accuracy 
widens—an indicator of an increasing degree of overfitting. 
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Addressing over- and underfitting with 
validation curves 


Validation curves are a useful tool for improving the performance of a model by 
addressing issues such as overfitting or underfitting. Validation curves are related to 
learning curves, but instead of plotting the training and test accuracies as functions of 
the sample size, we vary the values of the model parameters, for example, the 

inverse regularization parameter c in logistic regression. Let's go ahead and see how 
we create validation curves via scikit-learn: 


227 TEOM, SK LSarn.model SELECEION 2AMpOre Validation Curve 
277 Param tange = (0.01, U.01l; Oal, LeOy, 10sec, 100.0) 
Ze? Lied SCOLGS, Lest. SCOres = valiceation curve, 
SstlmMator=pipe. ir, 
X=X train, 
l=) toi, 
Param name—*locislicregression. Gy, 
param range=param range, 


cv=10) 
>>> Tirein Mean = ipsiean(Erainm Scores, axis=1.) 
277 Tien SLO = Npssta (Trait scores, exis=1) 
zor Lest Mean = Mpameen (test scores, axis=1) 


yer TOslk BSeG = Np«sStaG (test Scores, eaxis—!) 
>>> plt.plot(param range, train mean, 
COlOr="bite*, Marker="o", 
sss markersize=5, label='training accuracy') 
Po Pltlsi itt DeLween (param Lange; Traim Mean: + train std, 
Lien Mean = Crain Sle, elple-U.io, 
ee COlOr="5lue" ) 
>>> pilt.plot (param. range, test. mean, 
color='green', linestyle='--', 
marker='s', markers1ize=)5, 
ee label='validation accuracy') 
>>> plt.fill between (param range, 
Lest mean + tes. sid, 
test. Mean = Test sta, 
ee alpha=0.15, color='green') 
Po DlLeegrio |) 
22 Dita eCate ("00g "7 
>>> plt.legend(loc='lower right') 
>>> plt.xlabel('Parameter C') 
>>> plt.ylabel ('Accuracy') 
oo Pee en acy Leo h) 
>>> plt.show () 
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Using the preceding code, we obtained the validation curve plot for the parameter c: 
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Similar to the learning curve function, the validation curve function uses 
stratified k-fold cross-validation by default to estimate the performance of the 
classifier. Inside the validation curve function, we specified the parameter that we 
wanted to evaluate. In this case, it 1s c, the inverse regularization parameter of the 
LogisticRegression Classifier, which we wrote aS 'logisticregression C' to 
access the LogisticRegression object inside the scikit-learn pipeline for a specified 
value range that we set via the param range parameter. Similar to the learning curve 
example in the previous section, we plotted the average training and cross-validation 
accuracies and the corresponding standard deviations. 


Although the differences 1n the accuracy for varying values of c are subtle, we can 
see that the model slightly underfits the data when we increase the regularization 
strength (small values of c). However, for large values of c, 1t means lowering the 
strength of regularization, so the model tends to slightly overfit the data. In this case, 
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the sweet spot appears to be between 0.01 and 0.1 of the c value. 
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Fine-tuning machine learning models 
via grid search 


In machine learning, we have two types of parameters: those that are learned from 
the training data, for example, the weights in logistic regression, and the parameters 
of a learning algorithm that are optimized separately. The latter are the tuning 
parameters, also called hyperparameters, of a model, for example, the 
regularization parameter in logistic regression or the depth parameter of a decision 
tree. 


In the previous section, we used validation curves to improve the performance of a 
model by tuning one of its hyperparameters. In this section, we will take a look at a 
popular hyperparameter optimization technique called grid search that can further 
help improve the performance of a model by finding the optimal combination of 
hyperparameter values. 
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Tuning hyperparameters via grid search 


The approach of grid search 1s quite simple; it's a brute-force exhaustive search 
paradigm where we specify a list of values for different hyperparameters, and the 
computer evaluates the model performance for each combination of those to obtain 
the optimal combination of values from this set: 


wor LEOm SK Leatn.mocel selection ampere. Gricoearcncy 
>>> from sklearn.svm import SVC 


eo? Pipe sve = make pipeline (standardsacaler(),; 
ius SVC (random state=1) ) 
Poe Patcam range = 10.0001, U.00l, Us0l, Usd, 


ese Ley 200, WoC, OO.) 
Por Peram Grid = {4 °ove Crs Param 2ange, 
‘eve _- Xero. = | laneer ty 
tL Svc. Cs: Daram range, 
‘svc gamma’: param range, 
“eve. Kernels [eo I i 


per GS = GriCoeearencCy (Sstimalor=pipe Svc, 
param grid=param grid, 
SCOriImo="accuracy', 
Ccv=10, 

cas i SOs —1) 

Per OS = OS. ELC (X Crain, YY train) 

2 FP PEIN (GS ¢Dest SCOre.) 

0.984615384615 

Por Prime (Gs.Dest pateams. } 

t ove. ("se 100.0, “SVC _Gamma’'? O.001, “Sve. Kernels “Tbr” 


Using the preceding code, we initialized a GridSearchcv object from the 
sklearn.model selection module to train and tune a Support Vector Machine 
(SVM) pipeline. We set the param grid parameter of GridSearchcv to a list of 
dictionaries to specify the parameters that we'd want to tune. For the linear SVM, we 
only evaluated the inverse regularization parameter c; for the RBF kernel SVM, we 
tuned both the svc candsvc gamma parameter. Note that the svc gamma 
parameter is specific to kernel SVMs. 


After we used the training data to perform the grid search, we obtained the score of 
the best-performing model via the best score attribute and looked at its 
parameters that can be accessed via the best params attribute. In this particular 
case, the RBF-kernel SVM model with svc c = 100.0 yielded the best k-fold 
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cross-validation accuracy: 98.5 percent. 


Finally, we will use the independent test dataset to estimate the performance of the 
best-selected model, which is available via the best estimator attribute of the 
GridSearchcv object: 


yo Ci Osseo Soe imace, 

POP Clist Ie, Tlelily, VY thei) 

Por Deine Test aCCurae): @.ot" << Cli .2Core(, test, 7. lesl), 
Test accuracy: 0.974 


Note 


Although grid search is a powerful approach for finding the optimal set of 
parameters, the evaluation of all possible parameter combinations is also 
computationally very expensive. An alternative approach to sampling different 
parameter combinations using scikit-learn is randomized search. Using the 
RandomizedSearchcv Class in scikit-learn, we can draw random parameter 
combinations from sampling distributions with a specified budget. More details and 
examples of its usage can be found at http://scikit- 


learn.org/stable/modules/grid_search.html#randomized-parameter-optimization. 
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Algorithm selection with nested cross-validation 


Using k-fold cross-validation in combination with grid search is a useful approach 
for fine-tuning the performance of a machine learning model by varying its 
hyperparameter values, as we saw in the previous subsection. If we want to select 
among different machine learning algorithms, though, another recommended 
approach is nested cross-validation. In a nice study on the bias 1n error estimation, 
Varma and Simon concluded that the true error of the estimate is almost unbiased 
relative to the test set when nested cross-validation 1s used (Bias in Error Estimation 
When Using Cross-validation for Model Selection, BMC Bioinformatics, S. Varma 
and R. Simon, 7(1): 91, 2006). 


In nested cross-validation, we have an outer k-fold cross-validation loop to split the 
data into training and test folds, and an inner loop 1s used to select the model using k- 
fold cross-validation on the training fold. After model selection, the test fold 1s then 
used to evaluate the model performance. The following figure explains the concept 
of nested cross-validation with only five outer and two inner folds, which can be 
useful for large datasets where computational performance is important; this 
particular type of nested cross-validation is also known as 5x2 cross-validation: 
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In scikit-learn, we can perform nested cross-validation as follows: 


Per OS = GridoearcnCy (esltimalor=pipe Svc; 
param grid=param grid, 
SCOrimo=" accuracy’, 
CV=2) 


Pe? SCOTeES = Cross Val score(gs; XxX Crain, y train, 

a scoring="'accuracy', cCv=5) 

>>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), 
a3 np.std(scores) )) 
CY accuracy: Us974 +/=— 0.015 
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The returned average cross-validation accuracy gives us a good estimate of what to 
expect if we tune the hyperparameters of a model and use it on unseen data. For 
example, we can use the nested cross-validation approach to compare an SVM 
model to a simple decision tree classifier; for simplicity, we will only tune its depth 
parameter: 


>>> from sklearn.tree import DecisionTreeClassifier 


>>> gs = GridSearchCVv (estimator=DecisionTreeClassifier ( 
random state=0), 
Pevem. Gr10=—|7 "Mex Cepta’: [ly 2y oy 
4, 3, 6, 7, None] }], 
ScoOrimo="eccuracy’,; 
CV=2) 


Pre SCOres = Cross Val Score(gs, x train, YY rain, 

oe SCOring="accuracy’, CVv=5) 

>>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), 
ee np.std(scores) )) 
CV accuracy: 0.934 +/7= 0.016 


As we can see, the nested cross-validation performance of the SVM model (97.4 
percent) 1s notably better than the performance of the decision tree (93.4 percent), 
and thus, we'd expect that it might be the better choice to classify new data that 
comes from the same population as this particular dataset. 
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Looking at different performance 
evaluation metrics 


In the previous sections and chapters, we evaluated our models using model 
accuracy, which 1s a useful metric with which to quantify the performance of a 
model in general. However, there are several other performance metrics that can be 
used to measure a model's relevance, such as precision, recall, and the F1l-score. 
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Reading a confusion matrix 


Before we get into the details of different scoring metrics, let's take a look at a 
confusion matrix, a matrix that lays out the performance of a learning algorithm. 
The confusion matrix 1s simply a square matrix that reports the counts of the True 
positive (TP), True negative (TN), False positive (FP), and False negative (FN) 
predictions of a classifier, as shown in the following figure: 


Predicted class 


r IN 


‘True False 
P | positives negatives 
Gu (FN) 


Actual 
class 


False ‘True 
N | positives negatives 
(FP) (‘T'N) 





Although these metrics can be easily computed manually by comparing the true and 
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predicted class labels, scikit-learn provides a convenient confusion matrix function 
that we can use, as follows: 


Por LOM. SklLearnhs.Metrics: LNDOFE COntuUs1On Matrix 


por DADS: SVe.I1t (iota, Yo ee) 
Por FY Pred = Pipe Svc.predicre(x% Test) 


eee CONniMakl = COntuUsiONn Matrix (y Crue=/7 Test, Y-pred=y pred) 
Po DEINE (COnTMa cL) 

[{[71 1] 

[ 2 40] ] 


The array that was returned after executing the code provides us with information 
about the different types of error the classifier made on the test dataset. We can map 
this information onto the confusion matrix illustration in the previous figure using 
Matplotlib's mat show function: 


>>> fig, ax = plt.subplots (figsize=(2.5, 2.5)) 
>>> ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3) 
>>> for 1 in range(confmat.shape[0Q]): 
for jJ in range(confmat.shape[1]): 
ax.text(x=], y=l1, 
s=confmat[i, jl, 
4 < va='center', ha='center') 
>>> plt.xlabel ("predicted label') 
Por Dile VLabek (| “Lrue Label”) 
>>> plt.show() 


Now, the following confusion matrix plot, with the added labels, should make the 
results a little bit easier to interpret: 
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True label 





Predicted label 


Assuming that class 1 (malignant) 1s the positive class 1n this example, our model 
correctly classified 71 of the samples that belong to class 0 (TNs) and 40 samples 
that belong to class 1 (TPs), respectively. However, our model also incorrectly 
misclassified two samples from class 1 as class 0 (FN), and it predicted that one 
sample is malignant although it is a benign tumor (FP). In the next section, we will 
learn how we can use this information to calculate various error metrics. 
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Optimizing the precision and recall of a 
classification model 


Both the prediction error (ERR) and accuracy (ACC) provide general information 
about how many samples are misclassified. The error can be understood as the sum 
of all false predictions divided by the number of total predications, and the accuracy 
is calculated as the sum of correct predictions divided by the total number of 
predictions, respectively: 


FP+EN 


ERR = ———__—___ 
FP+FN+TP+IN 


The prediction accuracy can then be calculated directly from the error: 


46C eS ed ee 
FP +FN+TP+TN 


The True positive rate (TPR) and False positive rate (FPR) are performance 
metrics that are especially useful for imbalanced class problems: 





, _ 
N  FP4+TN 
= ise 

fre = " i 
P FN4+T7P 


In tumor diagnosis, for example, we are more concerned about the detection of 
malignant tumors in order to help a patient with the appropriate treatment. However, 
it is also important to decrease the number of benign tumors that were incorrectly 
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classified as malignant (FPs) to not unnecessarily concern a patient. In contrast to the 
FPR, the TPR provides useful information about the fraction of positive (or relevant) 
samples that were correctly identified out of the total pool of positives(P). 


The performance metrics precision (PRE) and recall (REC) are related to those true 
positive and negative rates, and in fact, REC is synonymous with TPR: 


PRE = - 


TP + FP 
REC a UPI on Ss me 
P  FN+TP 


In practice, often a combination of PRE and REC 1s used, the so-called F1-score: 
, PREx REC 


og [egy aca 
PRE + REC 


Those scoring metrics are all implemented in scikit-learn and can be imported from 
the sklearn.metrics module as shown in the following snippet: 


Poe EEOmM SklLeari«Melrics. LMPOrE Precision Score 
Pop TeOm SkicatismMeeriCe INMpOrt tecell Score, 1) .Score 


Por Peal PLSCisi1ON. cG.ck” @ Precis 1On, score 
ss yo -VeUue-y test, YY prco=y prea) ) 
Precision: 0.976 
Per Pinu Recalls w.ct” @ 16CaL Score, 

eos VY UlLue=y Test, YY prea=y pred) 
Recall: 0.952 
per PINE EL eo” @ Le SeOret 

eee VY true=y test; y predq=y_ pred) ) 
Fl: 0.964 
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Furthermore, we can use a different scoring metric than accuracy 1n the 
GridSearchcv via the scoring parameter. A complete list of the different values that 
are accepted by the scoring parameter can be found at http://scikit- 


learn.org/stable/modules/model_ evaluation. html. 


Remember that the positive class in scikit-learn is the class that 1s labeled as class 1. 
If we want to specify a different positive label, we can construct our own scorer via 
the make scorer function, which we can then directly provide as an argument to the 
scoring parameter in GridSearchcv (in this example, using the f1 score asa 
metric): 


>>> Trom Skiearn.Metrics amport. make Scorer, Tl Score 

Per SCOLGr = Make SCorer(rT!l score, pos label=0) 

pe Oe, = GEC eae We ocCilmaoL D1 pe Sic; 
param grid=param grid, 
scoring=scorer, 

ca cv=10) 

Poe OS = OS.T10 (x train; Y train) 

PS PIGS. bect score.) 

0.9860202145696 

Zo? DEINE GS <.0e6st patams ) 

(ove “os 10.0, “Sve Gama’? U.0l, “Sve sep aelt. *2or*) 
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Plotting a receiver operating characteristic 


Receiver Operating Characteristic (ROC) graphs are useful tools to select models 
for classification based on their performance with respect to the FPR and TPR, 
which are computed by shifting the decision threshold of the classifier. The diagonal 
of an ROC graph can be interpreted as random guessing, and classification models 
that fall below the diagonal are considered as worse than random guessing. A perfect 
classifier would fall into the top left corner of the graph with a TPR of 1 and an FPR 
of 0. Based on the ROC curve, we can then compute the so-called ROC Area Under 
the Curve (ROC AUC) to characterize the performance of a classification model. 


Similar to ROC curves, we can compute precision-recall curves for different 
probability thresholds of a classifier. A function for plotting those precision-recall 
curves 1s also implemented 1n scikit-learn and 1s documented at http://scikit- 


learn.org/stable/modules/generated/sklearn.metrics.precision_ recall curve.html. 


Executing the following code example, we will plot an ROC curve of a classifier that 
only uses two features from the Breast Cancer Wisconsin dataset to predict whether 
a tumor is benign or malignant. Although we are going to use the same logistic 
regression pipeline that we defined previously, we are making the classification task 
more challenging for the classifier so that the resulting ROC curve becomes visually 
more interesting. For similar reasons, we are also reducing the number of folds in the 
StratifiedKFold validator to three. The code is as follows: 


Zor Lom: Skluearn.Merrics LMpOre OC Curve, auc 
>2> LTrOm SeCipy iImpore 1nterp 


>>> pipe Ilr = make pipeline (StandardScaler(), 
PCA(h -Components=2) , 
LogisticRegression(penalty='12', 
random state=l1, 
C=100.0)) 


oer x Clain? = © Trails, ie, el J 


eer CY = IasU(otrat LT 1eokrold(m Ssplits=3; 

Fandom, State=l|)..splat(x train, 
— y fain) ) 
>>> fig = plt.figure(figsize=(7/, 5)) 


Por Mean LPL ie) 
-er mean Tpr = np.lanspace (0; 1, 100) 
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>>> 


Pee 


Po 7 
>>> 
>>> 
Pee 


Pee 


>>> 
>>> 
>>> 
>>> 


>>> 


>>> 
>>> 
a 
Peo 
>>> 
>>> 


aL Tor = |] 


for 1, (train, test) in enumerate(cv): 
Probes = pipe Arstitte CrainZ itr), 
VY ove h| train) ) <—predrcr probe(xX trainz | test) 
[pi, Cpt, Thtesnolas =. ros curveiy Erain|tescl, 
DrobDas|t, Ll, 
pos. tapel=1) 
mean Upr += 1nterpimean fpr, fpr, tpr) 
mean tpori,0] = 0.0 
LOC euc = aGuc(ipr, tTpr) 
pEt.«pLot(ipr, 
ee Osage 
label='ROC fold @d (area = %0.2f)' 
e (irl, ©£OC euc)) 
DLE.~pLot([U, Li, 
[O, hg 
linestyle='--', 
COLOT=(0.G;, UsGy Usb); 
label='random guessing') 


mean tpr /= len (cv) 
Meat “Eom iad), = 
Mean. auc = auc(mean. fpr, mean Cpr) 
DLE.plLoc(mean Tor, Mean Cpr, ~K=—", 
label='mean ROC (area = 70.2f)' % mean auc, lw=2) 
pLU.pLoLt lo, 0, Li, 
[O, 1, I / 
linestyle=':', 
Color="black*, 
label='perfect performance’) 
Pee. ea (S020 57. belo) 


DEbeyom l=0.05, We05)) 
plt.xlabel('false positive rate') 
plt.ylabel ('true positive rate') 
plt.legend(loc="lower right") 
plt.show () 


In the preceding code example, we used the already familiar stratifiedkKFold class 
from scikit-learn and calculated the ROC performance of the LogisticRegression 
classifier in our pipe 1r pipeline using the roc curve function from the 
sklearn.metrics module separately for each iteration. Furthermore, we interpolated 
the average ROC curve from the three folds via the interp function that we 
imported from SciPy and calculated the area under the curve via the auc function. 


The resulting ROC curve indicates that there 1s a certain degree of variance between 
the different folds, and the average ROC AUC (0.76) falls between a perfect score 
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(1.0) and random guessing (0.5): 


1.0 


0.8 5 


a) 
n 


true positive rate 
= 
a 


ROC fold 1 (area = 0.73) 
ROC fold 2 (area = 0.76) 
ROC fold 3 (area = 0.79) 
random guessing 

; mean ROC (area = 0.76) 
0.0 4 if perfect performance 


0.2 > 





0.0 0.2 0.4 0.6 0.8 1.0 
false positive rate 





Note if we are just interested in the ROC AUC score, we could also directly import 
the roc auc score function from the sklearn.metrics submodule. 


Reporting the performance of a classifier as the ROC AUC can yield further insights 
in a classifier's performance with respect to imbalanced samples. However, while the 
accuracy score can be interpreted as a single cut-off point on an ROC curve, A. P. 
Bradley showed that the ROC AUC and accuracy metrics mostly agree with each 
other: The use of the area under the roc curve in the evaluation of machine learning 
algorithms, A. P. Bradley, Pattern Recognition, 30(7): 1145-1159, 1997. 
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Scoring metrics for multiclass classification 


The scoring metrics that we discussed in this section are specific to binary 
classification systems. However, scikit-learn also implements macro and micro 
averaging methods to extend those scoring metrics to multiclass problems via One- 
versus-All (OvA) classification. The micro-average is calculated from the individual 
TPs, TNs, FPs, and FNs of the system. For example, the micro-average of the 
precision score in a k-class system can be calculated as follows: 


Li tases Ie 
2 = et 
© ee ees ee Pan ae ae a 


The macro-average 1s simply calculated as the average scores of the different 
systems: 


ap, PRE, +o + PRE 
PRE... =—____——+ 


Macro I 
1 


Micro-averaging 1s useful if we want to weight each instance or prediction equally, 
whereas macro-averaging weights all classes equally to evaluate the overall 
performance of a classifier with regard to the most frequent class labels. 


If we are using binary performance metrics to evaluate multiclass classification 
models in scikit-learn, a normalized or weighted variant of the macro-average is used 
by default. The weighted macro-average is calculated by weighting the score of each 
class label by the number of true instances when calculating the average. The 
weighted macro-average 1s useful if we are dealing with class imbalances, that is, 
different numbers of instances for each label. 


While the weighted macro-average is the default for multiclass problems 1n scikit- 
learn, we can specify the averaging method via the average parameter inside the 
different scoring functions that we import from the sklearn.metrics module, for 
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example, the precision score Of make scorer functions: 


Ze PES SCOLer = Take Scorer (Score BUNC=precisi1oOn. Score; 
pos jlabel=1, 
Create 12 Oller! Ve; 
average='micro') 
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Dealing with class imbalance 


We've mentioned class imbalances several times throughout this chapter, and yet we 
haven't actually discussed how to deal with such scenarios appropriately if they 
occur. Class imbalance is a quite common problem when working with real-world 
data—samples from one class or multiple classes are over-represented in a dataset. 
Intuitively, we can think of several domains where this may occur, such as spam 
filtering, fraud detection, or screening for diseases. 


Imagine the breast cancer dataset that we've been working with in this chapter 
consisted of 90 percent healthy patients. In this case, we could achieve 90 percent 
accuracy on the test dataset by just predicting the majority class (benign tumor) for 
all samples, without the help of a supervised machine learning algorithm. Thus, 
training a model on such a dataset that achieves approximately 90 percent test 
accuracy would mean our model hasn't learned anything useful from the features 
provided in this dataset. 


In this section, we will briefly go over some of the techniques that could help with 
imbalanced datasets. But before we discuss different methods to approach this 
problem, let's create an imbalanced dataset from our breast cancer dataset, which 
originally consisted of 357 benign tumors (class 0) and 212 malignant tumors (class 


1): 
>>> X imb = np.vstack((X[y == 0], X[y == 1][:40])) 
>>> y imb = np-hstack((yly == 0], yly == 1][:40])) 


In the previous code snippet, we took all 357 benign tumor samples and stacked 
them with the first 40 malignant samples to create a stark class imbalance. If we 
were to compute the accuracy of a model that always predicts the majority class 
(benign, class 0), we would achieve a prediction accuracy of approximately 90 
percent: 


Po? Y pred. = NpsZerosiy 1mb.shape (0) 
>>> np.mean(y pred == y imb) * 100 
89.924433249370267 


Thus, when we fit classifiers on such datasets, it would make sense to focus on other 
metrics than accuracy when comparing different models, such as precision, recall, 
the ROC curve—whatever we care most about in our application. For instance, our 
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priority might be to identify the majority of patients with malignant cancer patients 
to recommend an additional screening, then recall should be our metric of choice. In 
spam filtering, where we don't want to label emails as spam if the system is not very 
certain, precision might be a more appropriate metric. 


Aside from evaluating machine learning models, class imbalance influences a 
learning algorithm during model fitting itself. Since machine learning algorithms 
typically optimize a reward or cost function that is computed as a sum over the 
training examples that it sees during fitting, the decision rule is likely going to be 
biased towards the majority class. In other words, the algorithm implicitly learns a 
model that optimizes the predictions based on the most abundant class in the dataset, 
in order to minimize the cost or maximize the reward during training. 


One way to deal with imbalanced class proportions during model fitting is to assign a 
larger penalty to wrong predictions on the minority class. Via scikit-learn, adjusting 
such a penalty is as convenient as setting the class weight parameter to 

class weight='balanced', which 1s implemented for most classifiers. 


Other popular strategies for dealing with class imbalance include upsampling the 
minority class, downsampling the majority class, and the generation of synthetic 
training samples. Unfortunately, there's no universally best solution, no technique 
that works best across different problem domains. Thus, 1n practice, it is 
recommended to try out different strategies on a given problem, evaluate the results, 
and choose the technique that seems most appropriate. 


The scikit-learn library implements a simple resample function that can help with 
the upsampling of the minority class by drawing new samples from the dataset with 
replacement. The following code will take the minority class from our imbalanced 
breast cancer dataset (here, class 1) and repeatedly draw new samples from it until it 
contains the same number of samples as class label o: 


>>> from sklearn.utils import resample 


>>> print('Number of class 1 samples before:' 
X imb[y 1mb == 1].shape[0]) 
umber of class 1 samples before: 40 


>>> X upsampled, y upsampled = resample(X imb[y imb == 1], 
y imb[y imb == 1], 
replace=True, 
fi Ssemples=x% imD|ly amo == 0] .shapel0l, 
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Ss pon random State=l23) 

>>> print('Number of class 1 samples after:', 
X upsampled.shape[0]) 

abe of class 1 samples after: 357 


After resampling, we can then stack the original class 0 samples with the upsampled 
class 1 subset to obtain a balanced dataset as follows: 


>o> X bal = np.vstack((xXly == 0], X upsampled) ) 
ver y bal = fip.nstack((Vly == Ul, VY Upsanpled).) 


Consequently, a majority vote prediction rule would only achieve 50 percent 
accuracy: 


Por VY prea = Np.Zerosi(y bal. snape| 0 )) 
Poo Nip.amean(y pred: ==. y bal) ~ 100 


Similarly, we could downsample the majority class by removing training examples 
from the dataset. To perform downsampling using the resample function, we could 
simply swap the class 1 label with class 0 in the previous code example and vice 
versa. 


Note 


Another technique for dealing with class imbalance 1s the generation of synthetic 
training samples, which 1s beyond the scope of this book. The probably most widely 
used algorithm for synthetic training sample generation is Synthetic Minority Over- 
sampling Technique (SMOTE), and you can learn more about this technique in the 
original research article by Nitesh Chawla and others: SMOTE: Synthetic Minority 
Over-sampling Technique, Journal of Artificial Intelligence Research, 16: 321-357, 
2002. It is also highly recommended to check out imbalanced-learn, a Python 
library that is entirely focused on imbalanced datasets, including an implementation 
of SMOTE. You can learn more about imbalanced-learn at 


https://github.com/scikit-learn-contrib/imbalanced-learn. 
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Summary 


At the beginning of this chapter, we discussed how to chain different transformation 
techniques and classifiers in convenient model pipelines that helped us train and 
evaluate machine learning models more efficiently. We then used those pipelines to 
perform k-fold cross-validation, one of the essential techniques for model selection 
and evaluation. Using k-fold cross-validation, we plotted learning and validation 
curves to diagnose the common problems of learning algorithms, such as overfitting 
and underfitting. Using grid search, we further fine-tuned our model. We concluded 
this chapter by looking at a confusion matrix and various performance metrics that 
can be useful to further optimize a model's performance for a specific problem task. 
Now, we should be well-equipped with the essential techniques to build supervised 
machine learning models for classification successfully. 


In the next chapter, we will look at ensemble methods: methods that allow us to 
combine multiple models and classification algorithms to boost the predictive 
performance of a machine learning system even further. 
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Chapter 7. Combining Different 
Models for Ensemble Learning 


In the previous chapter, we focused on the best practices for tuning and evaluating 
different models for classification. In this chapter, we will build upon these 
techniques and explore different methods for constructing a set of classifiers that can 
often have a better predictive performance than any of its individual members. We 
will learn how to do the following: 


e Make predictions based on majority voting 

e Use bagging to reduce overfitting by drawing random combinations of the 
training set with repetition 

e Apply boosting to build powerful models from weak learners that learn from 
their mistakes 
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Learning with ensembles 


The goal of ensemble methods is to combine different classifiers into a meta- 
classifier that has better generalization performance than each individual classifier 
alone. For example, assuming that we collected predictions from 10 experts, 
ensemble methods would allow us to strategically combine these predictions by the 
10 experts to come up with a prediction that is more accurate and robust than the 
predictions by each individual expert. As we will see later in this chapter, there are 
several different approaches for creating an ensemble of classifiers. In this section, 
we will introduce a basic perception of how ensembles work and why they are 
typically recognized for yielding a good generalization performance. 


In this chapter, we will focus on the most popular ensemble methods that use the 
majority voting principle. Majority voting simply means that we select the class 
label that has been predicted by the majority of classifiers, that 1s, rece1ved more than 
50 percent of the votes. Strictly speaking, the term majority vote refers to binary 
class settings only. However, it is easy to generalize the majority voting principle to 
multi-class settings, which is called plurality voting. Here, we select the class label 
that received the most votes (mode). The following diagram illustrates the concept of 
majority and plurality voting for an ensemble of 10 classifiers where each unique 
symbol (triangle, square, and circle) represents a unique class label: 


000680880888 @  Unaninity 


SG9OOOOOAAAA  Moiority 


@@@Q@AAA! ||! Plurality 





C 


Using the training set, we start by training m different classifiers (& i ae) 
Depending on the technique, the ensemble can be built from different classification 
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algorithms, for example, decision trees, support vector machines, logistic regression 
classifiers, and so on. Alternatively, we can also use the same base classification 
algorithm, fitting different subsets of the training set. One prominent example of this 
approach is the random forest algorithm, which combines different decision tree 
classifiers. The following figure illustrates the concept of a general ensemble 
approach using majority voting: 


Classification 
models 


: 
2 
CO. 
o) 
5 


Predictions 


Final prediction 





To predict a class label via simple majority or plurality voting, we combine the 


7, 


. : — : C : \ 
predicted class labels of each individual classifier, / , and select the class label, ) 
that received the most votes: 
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y= mode} C, (x), os (x) ssn p hee (x)} 


“ml 


For example, in a binary classification task where class|=—I and class2 = +1 , we 
can write the majority vote prediction as follows: 


C(x) = sign sc (x) : if D, C; (x) 20 




















To illustrate why ensemble methods can work better than individual classifiers alone, 
let's apply the simple concepts of combinatorics. For the following example, we 
make the assumption that all n-base classifiers for a binary classification task have 


an equal error rate, “ . Furthermore, we assume that the classifiers are independent 
and the error rates are not correlated. Under those assumptions, we can simply 
express the error probability of an ensemble of base classifiers as a probability mass 
function of a binomial distribution: 


n 7 ; atoll 
P (y = k | — » rs | z" (1 7 0.25) | a © sncanibile 


Ir 


HT 
Here, \"/ is the binomial coefficient n choose k. In other words, we compute the 
probability that the prediction of the ensemble is wrong. Now let's take a look at a 


more concrete example of 11 base classifiers (/! — 11) where each classifier has an 
error rate of 0.25 (€ = 9.25); 


WOW! eBook 
www.wowebook.org 


 /11 \ Lk 


P(vy>k)=)(. )0.25*(1-€) © =0.034 


Note 


The binomial coefficient 


The binomial coefficient refers to the number of ways we can choose subsets of k 
unordered elements from a set of size n; thus, it is often called "n choose k." Since 
the order does not matter here, the binomial coefficient is also sometimes referred to 
as combination or combinatorial number, and in its unabbreviated form, it 1s written 
as follows: 


n! 


(n—k)\k! 


H=3K2x1=6 


Here, the symbol (!) stands for factorial—for example, 
As we can see, the error rate of the ensemble (0.034) 1s much lower than the error 
rate of each individual classifier (0.25) if all the assumptions are met. Note that, in 
this simplified illustration, a 50-50 split by an even number of classifiers 7 1s treated 
as an error, whereas this is only true half of the time. To compare such an idealistic 


ensemble classifier to a base classifier over a range of different base error rates, let's 
implement the probability mass function in Python: 


>>> from scipy.misc import comb 
>>> import math 
Por Oer Chisemble Crr1or (tm Classifier, S17Or)¢ 
k Start = inct(math.ceilim classitier 7 2.)) 
probs = (Como (nm Classifier, kK) * 
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Srror=*k * 

(1 -error) (ni Classi tier = Kk) 

Loe kn @ange (i Sea e, 2 Clee a 1) | 
ee return sum(probs) 
Per Snsemole- error (n Classitier=-ll, error—Us29) 
0.034327507019042969 


After we have implemented the ensemble error function, we can compute the 
ensemble error rates for a range of different base errors from 0.0 to 1.0 to visualize 
the relationship between ensemble and base errors 1n a line graph: 


>>> import numpy as np 

oor IMpOrt. Mato LocClLib.pyplou as ‘pit 

Pee CYrrOor range = np.arange(0.0, 1.01, 0.01) 

yo Ge eile = enoemole Cero in, Clal otter, Cerou—er ror, 

a FOr CLror Ad -error range! 

yee DitspLOLlerror Tange, Sens. errors, 
label='Ensemble error', 

soi linewidth=2) 

Peo DPiL«p LOL (SLLor Tange, error range, 
linestyle='--', label='Base error', 

ee linewidth=2) 

>>> plt.xlabel('Base error') 

>>> plt.ylabel('Base/Ensemble error') 

>>> plt.legend(loc="upper left') 

>>> plt.grid(alpha=0.5) 

>>> plt.show() 


As we can see 1n the resulting plot, the error probability of an ensemble is always 
better than the error of an individual base classifier, as long as the base classifiers 


perform better than random guessing (€ <-> ), Note that the y-axis depicts the base 
error (dotted line) as well as the ensemble error (continuous line): 
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Combining classifiers via majority vote 


After the short introduction to ensemble learning 1n the previous section, let's start 
with a warm-up exercise and implement a simple ensemble classifier for majority 
voting in Python. 


Note 
Although the majority voting algorithm that we will discuss in this section also 


generalizes to multi-class settings via plurality voting, we will use the term majority 
voting for simplicity, as it is also often done in the literature. 
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Implementing a simple majority vote classifier 


The algorithm that we are going to implement 1n this section will allow us to 
combine different classification algorithms associated with individual weights for 
confidence. Our goal 1s to build a stronger meta-classifier that balances out the 
individual classifiers’ weaknesses on a particular dataset. In more precise 
mathematical terms, we can write the weighted majority vote as follows: 


mi fice 8 4 
} = arg max 2 WL4(C,(x)=i) 


We | | | a, Me | 
Here, / is a weight associated with a base classifier, /“, ’ is the predicted class 


label of the ensemble, hs (Greek chi) is the characteristic function 


ss (x)=ieA| | | | 
Pe , and A 1s the set of unique class labels. For equal weights, we can 
simplify this equation and write it as follows: 


y = mode} C, (x), a (x) a 6 (x)| 


aw 


Note 


In statistics, the mode 1s the most frequent event or result in a set. For example, 
mode{1,2,1 1,2,4,5,4} = 1. 


To better understand the concept of weighting, we will now take a look at a more 
concrete example. Let us assume that we have an ensemble of three base classifiers, 


Cc. peor ; 
J (/ =e ), and want to predict the class label of a given sample instance, x. 


Two out of three base classifiers predict the class label 0, and one, C >, predicts that 
the sample belongs to class |. If we weight the predictions of each base classifier 
equally, the majority vote would predict that the sample belongs to class 0: 
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C(x) > 0, C,(x) 0, C,(x) >| 
y = mode }0,0,1} =0 


Fe | 4 4 


Now, let us assign a weight of 0.6 to C, and weight C, and C, by a coefficient of 
0.2: 


a 


p =argmax ) w.y,(C,(x) =i) 
. = ° an 
= arg max 0.2 XI, + 0.2 x1, + 0.6 x i, | =| 
i | 


4 


More intuitively, since 3 x 0.2 = 0.6, we can say that the prediction made by C, has 


4 4 


three times more weight than the predictions by C, or ~2, which we can write as 
follows: 


py =mode}0,0,1,1,1} =1 


To translate the concept of the weighted majority vote into Python code, we can use 
NumPy's convenient argmax and bincount functions: 


>>> import numpy as np 


Poo NOwalrdmax (ip.bincounrl ( (0, OU, 1], 
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— weights=[0.2, 0.2, 0.6])) 
1 


As we remember from the discussion on logistic regression in Chapter 3, A Tour of 
Machine Learning Classifiers Using scikit-learn, certain classifiers in scikit-learn 
can also return the probability of a predicted class label via the predict proba 
method. Using the predicted class probabilities instead of the class labels for 
majority voting can be useful if the classifiers in our ensemble are well calibrated. 
The modified version of the majority vote for predicting class labels from 
probabilities can be written as follows: 


Ve 


. — “CF a" A’ FD 
J mS ae » Wy ii 
j=l 


Here, Py is the predicted probability of the jth classifier for class label i. 


To continue with our previous example, let's assume that we have a binary 


--In 1h 
i € {0,1} 


classification problem with class labels and an ensemble of three 


fe fi \ 
C, gS 23; 


(C. 
classifiers / ( ). Let's assume that the classifiers ~“ return the 
following class membership probabilities for a particular sample x: 


C, (x) [0.9,0.1], C, (7) >[0.8,0.2], C, (x) [0.4,0.6] 


We can then calculate the individual class probabilities as follows: 


p(iy |x) =0.2x0.9+0.2x0.8+0.6x 0.4 = 0.58 
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p(i, |x)=0.2x0.1+0.2x0.2+0.6x0.6 =0.42 


* 


p=arg max| p (i, |x), p(i, | x) | = 0) 


To implement the weighted majority vote based on class probabilities, we can again 
make use of NumPy using numpy.average and np.argmax: 


>>> ex = np.array(l 


>>> p = np.average ( 
>So 

array([ 0.58, 0.42]) 
>>> np.argmax (p) 

0 


, weights=[0.2, 0.2, 0.6]) 


Putting everything together, let's now implement Maj orityVoteClassifier In 
Python: 


from sklearn.base import BaseEStimator 

from sklearn.base import ClassifierMixin 

from sklearn.preprocessing import LabelEncoder 
from sklearn.externals import six 

from sklearn.base import clone 

from sklearn.pipeline 1amport hame estimators 
import numpy as np 

import operator 


class MajorityVoteClassifier (BaseEStimator, 
ClassifierMixin): 
muw A majority vote ensemble classifier 


Parameters 


Classitilers < atray-like, shape = [nm Classi tiers] 
Different classifiers for the ensemble 


VOLe ¢ Str, {1 Clhasslabel’, *probabiliry’ } 
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Default: ‘'classlabel' 

If 'classlabel' the prediction is based on 

the argmax of class labels. Else if 
"probability', the argmax of the sum of 
probabilities is used to predict the class label 
(recommended for calibrated classifiers). 


Weigle. + array Like, Ghepe = iit Chaosiriers| 
Optional, default: None 
If a list of ‘int’ or ‘float’ values are 
provided, the classifiers are weighted by 
importance; Uses uniform weights if “weights=None. 


wesw 


Celt tte sell, Closoti tere, 
vote='classlabel', weights=None): 


self.classifiers = classifiers 
Selt.Memed Classifiers = {key: Value for 
key, value in 
Name Sst imecors (Classiiiers) | 
self.vote = vote 
self.weights = weights 


def fit(self, X, y): 
mum Fit classifiers. 


Parameters 
X : {array-like, sparse matrix}, 
shiape = | Samples; m features] 


Matrix of training samples. 


Y > e@rray-like, Shape = [nm samples) 
Vector of target class labels. 


RELCUrnS 


self : object 


wesw 


# Use LabelEncoder to ensure class labels start 
# with O, which is important for np.argmax 
# call in self.predict 


selt.Jdablenc = lapel Encoger () 
SelLiceleuLenc. hie) 
Sel teCiceoes = Celis teotene s.Clasce. | 


self.classifiers = [] 
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for Clr in selit.classitiers: 
Libeed Cli = Clone (Clin yi ix, 
SCli«MADLenG. -Lralslorm(y) ) 
SGlieCleascit1Grs append (titted. cit) 
return self 


I've added a lot of comments to the code to explain the individual parts. However, 
before we implement the remaining methods, let's take a quick break and discuss 
some of the code that may look confusing at first. We used the BaseEstimator and 
ClassifierMixin parent classes to get some base functionality for free, including 
the get params and set params methods to set and return the classifier's parameters, 
as well as the score method to calculate the prediction accuracy. Also note that we 
imported six to make MajorityVoteClassifier compatible with Python 2.6. 


Next, we will add the predict method to predict the class label via a majority vote 
based on the class labels 1f we initialize a new MajorityVoteClassifier object with 
vote='classlabel'. Alternatively, we will be able to initialize the ensemble 
classifier with vote='probability' to predict the class label based on the class 
membership probabilities. Furthermore, we will also add a predict proba method 
to return the averaged probabilities, which is useful when computing the ROC AUC: 


def predict(self, X): 
"ym Predict class labels for X. 


Parameters 
X : {array-like, sparse matrix}, 
~hepe = |n Samples, nm teatures|| 


Matrix of training samples. 


maj vote +; atrray-like, shape = |[n samples 
Predicted class labels. 


Wwe vy 


if self.vote == 'probability': 
Ma] vole = np.argmax(self.predict proba (x) ,; 
ax1is=1) 
else: # 'classlabel' vote 


# Collect results from clf.predict calls 
predictions = np.asarray([clf.predict (X) 
FOr 2 aan 
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def 


def 


eel eClese vere Ie 


maj vote = np.apply along axis ( 
lambda x: 
np.argmax(np.bincount (x, 
weights=self.weights)), 
axis=l, 
arr=predictions) 
Ma) vote. = Selt.tablenc .,iInverse Lransfiorm(ma) Voce) 
return maj vote 


Predict. probaiselt, x)= 
mum Predict class probabilities for X. 


Falamevers 
X : {array-like, sparse matrix}, 
shape = | Samples, nm teatures| 


Training vectors, where n samples is 
the number of samples and 
hn Peacttres 15 the number of features: 


Returns 
aevG Probe = array —like, 
shape. = |n Samples, nm Classes] 


Weighted average probability for 
each Class per sample. 


wesw 


probas = np.asarray((ClEspredict proba (xX) 
LOe Clit. 10 Selresclescitiere |) 
avg proba = ND«average(probas, 


ax1is=0, weights=self.weights) 
return avg proba 


Get. Params iseli, Gee p-lrue) ; 
mmm Get classifier parameter names for GridSearch""™" 
if not deep: 
return super (MajorityVoteClassifier, 
eel) sOCU. Params (GCep—falec) 
else: 
Out = seli named, classitiers. copy () 
for name, step in\ 
Six. Iteritens(Seliwnamed. Classitiers) : 
for key, value in six.iteritems ( 
StepsOSet params (Cesp—True))< 


e) 


out['ss ss' © (name, key)] = value 
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return out 


Also, note that we defined our own modified version of the get params method to 
use the name estimators function to access the parameters of individual classifiers 
in the ensemble; this may look a little bit complicated at first, but it will make perfect 
sense when we use grid search for hyperparameter tuning 1n later sections. 


Note 


Although the Maj orityVoteClassifier implementation is very useful for 
demonstration purposes, we implemented a more sophisticated version of this 
majority vote classifier in scikit-learn based on the implementation in the first edition 
of this book. The ensemble classifier 1s available as 
sklearn.ensemble.VotingClassifier In scikit-learn version 0.17 and newer. 
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Using the majority voting principle to make 
predictions 


Now it 1s about time to put the MajorityVoteClassifier that we implemented in the 
previous section into action. But first, let's prepare a dataset that we can test it on. 
Since we are already familiar with techniques to load datasets from CSV files, we 
will take a shortcut and load the Iris dataset from scikit-learn's dataset module. 
Furthermore, we will only select two features, sepal width and petal length, to 
make the classification task more challenging for illustration purposes. Although our 
MajorityVoteClassifier generalizes to multiclass problems, we will only classify 
flower samples from the Iris-versicolor and Iris-virginica Classes, with which 
we will compute the ROC AUC later. The code is as follows: 


>>> from sklearn import datasets 

Pee LTrom skleara.model. selection amport train test. splice 
>>> from sklearn.preprocessing import StandardScaler 

>>> from sklearn.preprocessing import LabelEncoder 

woe? dea = Ueto eos OCG, ale 

Po hy YY = 2s. Gatal 502, [1l, 2illy t2ers.target (50. | 

>>> le = LabelEncoder () 

Zo? yy = 16. rene orn) 


Note 


Note that scikit-learn uses the predict proba method (if applicable) to compute the 
ROC AUC score. In Chapter 3, A Tour of Machine Learning Classifiers Using scikit- 
learn, we saw how the class probabilities are computed in logistic regression models. 
In decision trees, the probabilities are calculated from a frequency vector that is 
created for each node at training time. The vector collects the frequency values of 
each class label computed from the class label distribution at that node. Then, the 
frequencies are normalized so that they sum up to 1. Similarly, the class labels of the 
k-nearest neighbors are aggregated to return the normalized class label frequencies in 
the k-nearest neighbors algorithm. Although the normalized probabilities returned by 
both the decision tree and k-nearest neighbors classifier may look similar to the 
probabilities obtained from a logistic regression model, we have to be aware that 
these are actually not derived from probability mass functions. 


Next, we split the Iris samples into 50 percent training and 50 percent test data: 


>>> X_ train, X_ test, y_ train, y_ test =\ 
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Liaift Lest splavii, Vi 
Lest. S1Z26-Uis5, 
random state=l, 
stratify=y) 


Using the training dataset, we now will train three different classifiers: 


Logistic regression classifier 
Decision tree classifier 
k-nearest neighbors classifier 


We then evaluate the model performance of each classifier via 10-fold cross- 
validation on the training dataset before we combine them into an ensemble 
classifier: 


>>> 
>>> 
Eg 
>>> 
>>> 
>>> 
>>> 


Sg 
>>> 
>>> 
>>> 
>>> 


>>> 
>>> 


From, 6k beatn.model Se leCliOn I2MpOrl “Cross. Val score 
from Skiearn. inear model 1amport bogistacrhegress1.0n 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline 

import numpy as np 

clfl = LogisticRegression(penalty='12', 

C=0.001, 
random state=1) 

Decl si 0niresCleassi 1167 (mex cepca=l, 
criterion='entropy', 
random state=0) 

Clits = KNe7 GhboersClassitier (ly neirghbors—1, 


cl1f2 


D=24 
metric='minkowski') 
pipel = Pipeline([['sc', StandardScaler()], 
[*elr*, clr} |) 
pipe3 = Pipeline([['sc', StandardScaler()], 
L clo, Clr] 4) 
Cli labels = | *hogistic regression’, "Decision tree*, “KNN” | 


print('10-fold cross validation: \n') 
LOr Cle, tebe. 2: Zap lpipel, Cliz, pipes), Cle tabels): 
SCOres = Cross Val Score (estimalor—cir, 
X=X_ train, 
Vay Grain, 
cv=10, 
SCOLING=" LOC. auc” ) 
Print ("ROC AUCs S20z2t (+7— 20.22) [ese)™ 
6 (scores.mean(), scores.std(), label) ) 


The output that we receive, as shown in the following snippet, shows that the 
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predictive performances of the individual classifiers are almost equal: 


10-fold cross validation: 


ROC AUC: 0.87 (+/- 0.17) [Logistic regression] 
ROC AUC: 0.89 (+/- 0.16) [Decision tree] 
ROC AUC: 0.88 (+/- 0.15) [KNN] 


You may be wondering why we trained the logistic regression and k-nearest 
neighbors classifier as part of a pipeline. The reason behind it is that, as discussed in 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, both the 
logistic regression and k-nearest neighbors algorithms (using the Euclidean distance 
metric) are not scale-invariant, in contrast to decision trees. Although the Iris 
features are all measured on the same scale (cm), it is a good habit to work with 
standardized features. 


Now let's move on to the more exciting part and combine the individual classifiers 
for majority rule voting in our MajorityVoteClassifier: 


Por UY Cli = MajOorieyvorveC lassi rier | 
Te classifiers=[pipel, clf2, pipe3]) 
por C1t Jabels += [Majority voring’ | 


PoP olde lL = papel, Gli, Dipeo, my elt 
Bor TOY Clit, dabel An. Ziptall cir, Cli jedelks) 
SCOlreS = Cross Val. Score estiamalo elt, 
X=X train, 
v= ea, 
cv=10, 


Seong — Foc ..2uc’ |) 
prink("Accuracy: GO0.2f (47= 40.21). [os]™ 


a8 + (scores.mean(), scores.std(), label) ) 
ROC AUC: 


0.87 (+/- 0.17) [Logistic regression] 
ROC AUC: 0.89 (+/- 0.16) [Decision tree] 
ROC AUC: 0.88 (+/- 0.15) [KNN] 
ROC AUC: 0.94 (+/- 0.13) [Majority voting] 


As we can see, the performance of MajorityVotingClassifier has improved over 
the individual classifiers in the 10-fold cross-validation evaluation. 
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Evaluating and tuning the ensemble classifier 


In this section, we are going to compute the ROC curves from the test set to check 
that MajorityVoteClassifier generalizes well with unseen data. We shall 
remember that the test set 1s not to be used for model selection; its purpose 1s merely 
to report an unbiased estimate of the generalization performance of a classifier 
system: 


yor Lom SKLeari.metraCcs AmMpOort LOC curve 
>>> from sklearn.metrics import auc 
>>> colors = ['black', ‘'orange', ‘'blue', ‘'green'] 
>>> linestyles = [':', '--', '-.', '-"'] 
>>> for elt, label, @lz, 1s. 
iW ZiapteaL! Clr, Clit tebels, Colors; Dinestyles) 
# assuming the label of the positive class is 1 
VY pred = ¢Cli.tie(x train, 
y Uraim) -~precLece probat(x Test) le, iL 
LO, Ory ENresnOhkas — 1OC Curvely T7ue-7 tesL, 
¥Y SCOre=y preg) 
COC eUuGc: = @uC(xX—-fpr, Y=cpr) 
Dit.pLoc(t pre, Lpr, 
COLOT=Cly, 
linestyle=ls, 
a es Label="6es (auc = oU.71).” © (label, FOC auc) ) 
>>> plt.legend(loc='"lower right") 
Peo PLespLocelos, wi, IOs Li, 
linestyle='--', 
COLOr="OGray™ 4 
a4 linewidth=2) 
Poe Ot. <a | 1.1]) 
Peo Pie. yea, || 1.1]) 
Zo DLivgGrid(alpna=U.5) 
>>> plt.xlabel('False positive rate (FPR)'") 
>>> plt.ylabel ("True positive rate (TPR) ') 
>>> plt.show() 


a cley 
=O eal 


As we can see in the resulting ROC, the ensemble classifier also performs well on 
the test set (ROC AUC = 0.95). However, we can see that the logistic regression 
classifier performs similarly well on the same dataset, which is probably due to the 
high variance (in this case, sensitivity of how we split the dataset) given the small 
size of the dataset: 
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Logistic regression (auc = 0.95) 
-=- Decision tree (auc = 0.90) 
—-- KNN (auc = 0.86) 
—— Majority voting (auc = 0.95) 


oo 
oa. 
= 
w) 
+ 
ie) 
= 
W) 
> 
— 
u) 
© 
on 
w) 
ual 
/ 
Kt 


0.4 0.6 
False positive rate (FPR) 





Since we only selected two features for the classification examples, 1t would be 
interesting to see what the decision region of the ensemble classifier actually looks 
like. Although it 1s not necessary to standardize the training features prior to model 
fitting, because our logistic regression and k-nearest neighbors pipelines will 
automatically take care of it, we will standardize the training set so that the decision 
regions of the decision tree will be on the same scale for visual purposes. The code is 
as follows: 


>>> sc = StandardScaler () 
por? ® Yai sto. = SCsiatl Lransrormm (> Train) 
Pe> ELrOMm LeEereools 1mpOore Droduce 
Zoe XM = % Leaan std: QO] .min 


’ © ee 
Por x Max = xX train stale, Ulwamax() +r 1 
>po YY Win = x train Steals, liam) = 1 
Per ¥ Max = 7A Crain Stale, dj smax() + I 


277 SX, VY = Np sMSsngri1d(p.aerange (x Min, 2 Max, Usl)y 
+ NpsealanGely Min, Yi.Mmex, Val)? 
>>> £, axarr = plt.subplots (nrows=2, ncols=2, 
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Ssharex="Col", 
sharey='row', 
ee figsize=(7, 5)) 
Zor TOV 20x, Cli, Le an zap(producc ( (0, ly, [O, 2ljy 
aki Clty, Chi Jbaoels): 
ClLi,titl ye tren oro, een) 


i = ClE.~pereorce (np.C (xx. tavel(); Vyystavel() 1) 

Z= 4Z.reshape(xx.shape) 

axarr[idx[0], idx[1]].contourf (xx, yy, 4, alpha=0.3) 
axerre (10x [0l, 2exlLilesecatrer(x train Stdaly train==), 0), 


Reet Soa ti aio, A 
c='blue', 
marker='“*', 
Ss=50) 
axarr (10x |O0l, text) | ssGatter( x train staly train==1, Ol, 
XK train Staly train-—-l, Ll, 
c='green', 
marker='o!', 
s=50) 
as axerT (20Oxe|0);, tex] l.set Carle (ce) 
Por Plt. TExXt (=Sa5, =449, 
s='Sepal width [standardized]', 
eee ha='center', va='center', fontsize=12) 
Par OI etext (=—L0s5, 4uo, 
s='Petal length [standardized]', 
ha='center', va='center', 
cas fontsize=12, rotation=90) 
>>> plt.show() 


Interestingly, but also as expected, the decision regions of the ensemble classifier 
seem to be a hybrid of the decision regions from the individual classifiers. At first 
glance, the majority vote decision boundary looks a lot like the decision of the 


= ae sepal width = 
decision tree stump, which is orthogonal to the y axis for sepal width > | 


However, we also notice the non-linearity from the k-nearest neighbor classifier 
mixed in: 
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Logistic regression Decision tree 


Petal length [standardized] 





Sepal width [standardized] 





Before we tune the individual classifier's parameters for ensemble classification, let's 
call the get params method to get a basic idea of how we can access the individual 
parameters inside a GridSearch object: 


eee MT CilsGeu params () 
t CeCLSlOnureeClessifier”’ =. Decistonirec C lassiitier (Class welcnt=None, 
CriterloOn="“CnLropy”; max Cepltn=1, 
max features=None, 
Max 1tGat NoOGes=None, 
Min samples. beat=l, 
Min. Samples: Splat=2Z, Min weight traction lear=—0.0, 
fandom. Slate=0, Sob Ever —"Dese*) 


'decisiontreeclassifier class weight': None, 

'decisiontreeclassifier criterion’: fentropy', 

See 

"'decisiontreeclassifier random state': O, 

'decisiontreeclassifier splitter’: Dest” » 

‘papelane=1"*; Pipeline(steps=[('sc', standardScaler (copy=True, 
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Wilh. Mean=Irue, with. Sva=Irue)), ("cli’; Log st ickegression (C=-U.001, 
Glass: welQnt=None, dual=Faise, Tit 1olercepl=irue, 
Intercept Scaling=L,; Max. 1 ter—=100, malt Clhass="ovr; 
pene ley="42", LPancom state=0, Solver—"11blinear’, LolL=0.,0001, 
verbose=0))]), 
“Pipeline sc1f*s OG Sse LCheoress10n(C—0.001, ~class. we1gne-None, 
CUal=-Faise, Lie. tnrerce c=i rue, 
Ibe reep... Scaling=!, Mex AGer=l00, Mule Clase= Over"; 
PetelGy="i2"7 Tendon Srate—Ur Solve =" 1 piimear, 
tol=0.0001, 
verbose=0), 
‘Pipelines. Cir A. t.00r, 
‘pipeline=1. Clr Class wernt”: None, 
‘Dipewie- Cle sel’. false, 
ae. 2. 





"‘Dipetine- 6C With Seas Tee, 
'pipeline-2': Pipeline(steps=[('sc', StandardScaler (copy=True, 
WLEn Mean=—True, With Scd=True)), (°cli, 


KNe1ghboOrsC lassit let (algoriinm="auvo 7 16ar.s176-350;, 
metric='minkowski', 
metric params=None, n neighbors=1, p=z, 
weights='uniform'))]), 
"Pplpeliane=2 Cli" s KNeionborsC lassi tier (algorithm "auco’, 
iGeat 6126-30; Metric— MinKoOwskL” y 
MGLELC Params=None, T Netgnbors=—l,; p=2, Wwelgnts=" Uniform); 
"pape ne=2, Cle eloorichim =: “auco, 
Ce ee 


"Diapellne=2 SC With sta’: True, 


Based on the values returned by the get params method, we now know how to 
access the individual classifier's attributes. Let's now tune the inverse regularization 
parameter C of the logistic regression classifier and the decision tree depth via a grid 
search for demonstration purposes: 


vo? LOOM eh learn.mode, SElece lon, 1mport Gridoearcncy 
Por Params = 4 “CSCisiontreeclassiticr Max -cepth’: (Ty, Zi; 
ae ‘pipeltneHL.. Cir. *s (0.0017 U.l, LOC.0ls 
eer GL = GridocarcnCy (estimator =my cir, 

param grid=params, 

cv=10, 
said SCOLINgG="FrOC auc”) 
Peo OL1O.T U(x train, Y Peal) 


After the grid search has completed, we can print the different hyperparameter value 
combinations and the average ROC AUC scores computed via 10-fold cross- 
validation as follows: 
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PP Or Params, Mean SCOle, SCOTeS 10 OCr1e.011G Scores & 
Prine cJsctry—osZe or 


ee % (mean score, scores.std() / 2, params) ) 
O.93o a7 Ooo? 1 pipeline “clr ws 0.001, 
"OSCISTONELLSCSCIasSi fier Max cCepch’ =. 1) 
0.947 ar= U.07 1° procline=1 clr. ©€*s 0.1, 
"OSCUSLONELSSClassi fier Max. GCeptn’. 1) 
0.973 +/- 0.04 {'pipeline-1 clf cC': 100.0, 
MOSClSLONELCSClaSssiticr Max Ceprh’ = 1) 
02947 a7> O.07 4*papeline=-1 el Cs ©2001, 
"OSCISLONETSCCClasSsiITIGr Max Cepti’s Zi 
Ue947 to~ 0.07 4*pipeline=1 clr GC: Oely 
“GSCISLONLTSGSClassiItTicr max GCepths zi 
0.973 +/- 0.04 {'pipeline-1 clf C': 100.0, 
"GSCISLONLISCSCCIassifier Max GCepth’ >: Zi 


ie) 


Po> pPrint(’ Best. parameters: «s"* « gGrid.best params |) 
Best. Patametete: 1*pipeline-i clr. ©: 100.9, 
MOSCISLONLTCSClaSsi fier Max Cepia”: 1) 


Pop Prime PeCCulacCy: @eZL° ~ OGrrd.best. Score } 
Recuracy: 0.07 


As we can see, we get the best cross-validation results when we choose a lower 
regularization strength (c=100.0), whereas the tree depth does not seem to affect the 
performance at all, suggesting that a decision stump 1s sufficient to separate the data. 
To remind ourselves that it is a bad practice to use the test dataset more than once for 
model evaluation, we are not going to estimate the generalization performance of the 
tuned hyperparameters in this section. We will move on swiftly to an alternative 
approach for ensemble learning: bagging. 


Note 


The majority vote approach we implemented in this section is not to be confused 
with stacking. The stacking algorithm can be understood as a two-layer ensemble, 
where the first layer consists of individual classifiers that feed their predictions to the 
second level, where another classifier (typically logistic regression) 1s fit to the level- 
1 classifier predictions to make the final predictions. The stacking algorithm has 
been described in more detail by David H. Wolpert in Stacked generalization, Neural 
Networks, 5(2):241—259, 1992. 


Unfortunately, an implementation of this algorithm has not been implemented in 


scikit-learn at the time of writing; however, this feature 1s under way. In the 
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meantime, you can find scikit-learn-compatible implementations of stacking at 


http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/ and 
http://rasbt.github.10/mlxtend/user_guide/classifier/StackingC V Classifier/. 
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Bagging — building an ensemble of 
classifiers from bootstrap samples 


Bagging 1s an ensemble learning technique that is closely related to the 
MajorityVoteClassifier that we implemented 1n the previous section. However, 
instead of using the same training set to fit the individual classifiers in the ensemble, 
we draw bootstrap samples (random samples with replacement) from the initial 
training set, which is why bagging is also known as bootstrap aggregating. 


The concept of bagging 1s summarized in the following diagram: 








Bootstrap 
samples 





| 


Classification 
models 


Predictions 


Final prediction 


In the following subsections, we will work through a simple example of bagging by 
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hand and use scikit-learn for classifying wine samples. 
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Bagging in a nutshell 


To provide a more concrete example of how the bootstrapping aggregating of a 
bagging classifier works, let's consider the example shown in the following figure. 
Here, we have seven different training instances (denoted as indices 1-7) that are 
sampled randomly with replacement in each round of bagging. Each bootstrap 


sample is then used to fit a classifier “ , which is most typically an unpruned 
decision tree: 





Sample Bagging Bagging 
Tatellass melelatem round 2 
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As we can see from the previous illustration, each classifier receives a random subset 
of samples from the training set. Each subset contains a certain portion of duplicates 
and some of the original samples don't appear in a resampled dataset at all due to 
sampling with replacement. Once the individual classifiers are fit to the bootstrap 
samples, the predictions are combined using majority voting. 


Note that bagging 1s also related to the random forest classifier that we introduced in 
Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. In fact, 
random forests are a special case of bagging where we also use random feature 
subsets when fitting the individual decision trees. 


Note 


Bagging was first proposed by Leo Breiman 1n a technical report in 1994; he also 
showed that bagging can improve the accuracy of unstable models and decrease the 
degree of overfitting. I highly recommend you read about his research in Bagging 
predictors, L. Breiman, Machine Learning, 24(2):123—140, 1996, which is freely 
available online, to learn more details about bagging. 
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Applying bagging to classify samples in the 
Wine dataset 


To see bagging in action, let's create a more complex classification problem using 
the Wine dataset that we introduced in Chapter 4, Building Good Training Sets -- 
Data Preprocessing. Here, we will only consider the Wine classes 2 and 3, and we 
select two features: Alcohol and 0D280/0D315 of diluted wines: 


>>> import pandas as pd 

>>> df wine = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases/wine/wine.data', 
header=None) 

eer Of Wine.cOlumns = (Class. label“, “AECOnOL” , 

"Malic acid', ‘Ash', 

"Alcalinity of ash', 

'Magnesium', 'Total phenols', 

'Flavanoids', 'Nonflavanoid phenols', 

"Proanthocyanins', 

"Color intensity', 'Hue', 

'OD280/0D315 of diluted wines', 

saa he eOliie” | 

>>> # drop 1 class 

ye OL Wie = CF Wine ar wie lols Tabel | 2S 14 

>>> y = df wine['Class label'].values 

Por k= OF wane! | AlCOono.” , 

'OD280/0D315 of diluted wines']].values 


Next, we encode the class labels into binary format and split the dataset into 80 
percent training and 20 percent test sets, respectively: 


>>> from sklearn.preprocessing import LabelEncoder 
Por TOM Sklearismodel Selection amporl train Ceste Split 
>>> le = LabelEncoder () 
Poe YY = 6.120 Transform (y) 
>>> X train, X test, y_ train, y test =\ 
train test split (x, y; 
bes. 6126-042, 
rancom Sstate=—1, 
stratify=y) 


Note 


You can find a copy of the Wine dataset (and all other datasets used in this book) in 
the code bundle of this book, which you can use if you are working offline or the 
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UCI server at https://archive.ics.uci.edu/ml/machine-learning- 
databases/wine/wine.data is temporarily unavailable. For instance, to load the Wine 


dataset from a local directory, take these lines: 


df = pd.read csv('https://archive.ics.uci.edu/ml/' 
'machine-learning-databases' 
'/wine/wine.data', 
header=None) 


Note 


Replace them with this: 


df = pd.read csv('your/local/path/to/wine.data', 
header=None) 


A BaggingClassifier algorithm 1s already implemented in scikit-learn, which we 
can import from the ensemble submodule. Here, we will use an unpruned decision 
tree as the base classifier and create an ensemble of 500 decision trees fit on different 
bootstrap samples of the training dataset: 


>>> from sklearn.ensemble import BaggingClassifier 
>>> tree = DecisionTreeClassifier(criterion='entropy', 
rancom state=1, 
+4 max depth=None) 
-or Dag = Bagging Classitier (bese. estimalor=tirec, 
Nn SStimalors= 00, 
Max Ssanples=1.0, 
max features=1.0, 
bootstrap=True, 
DOoUstrap teavures= alse, 
i, JObs= i, 
rangom Sstate=1)) 


Next, we will calculate the accuracy score of the prediction on the training and test 
dataset to compare the performance of the bagging classifier to the performance of a 
single unpruned decision tree: 


>>> Trom sklearn.Metrt ics ImMpOre accuracy score 

Poe eee = Chee welt Clotie 7 Peat) 

por Y Crain. pred. = Tree. precdicr (x Train) 

Por VY Cest pred = Eree.predicc(x test) 

PoP LECS. train = accuracy Score (y train, Y train prea) 
yo? Tree eeu = accuracy sCOTe(y tect, YY tcsk pred) 

>>> print ("Decision tree train/test accuracies %.3£/%.3f£' 
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e) 


ck e (llee train, tree test) 
Decision tree train/test accuracies 1.000/0:833 


Based on the accuracy values that we printed here, the unpruned decision tree 
predicts all the class labels of the training samples correctly; however, the 
substantially lower test accuracy indicates high variance (overfitting) of the model: 


eo? beg = Dadgwsllit% Uiain, y train) 
Pee vy ttaam pred = bag.spredicLe(x% train) 
per VY weet. Pred = Dedg.predicre(x% test) 
Per DAG Ciaim = accuracy score (y train, Y Crain pred) 
ver Wag Lest. = accuracy SCOTe(y Test, YY Csst pred) 
>>> print('Bagging train/test accuracies %.3f/%.3f' 
6 (bag train, bag test) ) 
Baceans train/test accuracies 1.000/0.917 


Although the training accuracies of the decision tree and bagging classifier are 
similar on the training set (both 100 percent), we can see that the bagging classifier 
has a slightly better generalization performance, as estimated on the test set. Next, 
let's compare the decision regions between the decision tree and the bagging 
classifier: 


Poe. 2. MN 2 een | U} «man() = 
zor & Max = XX Crain, O] smax() = 
2 VP aay = ean, Ll) = 
eee J Max = A Thain |<, 1) smax{) + 


>>> XX, yy = np.meshgrid(np.arange(x min, x max, 0.1), 
ee NPp.eraenge(y Minny Y max, 0.1)) 
>>> f£, axarr = plt.subplots (nrows= ncols=Z, 
sharex="Ccol", 
sharey='row', 
Cas figsize=(8, 3)) 
Por TOY MOx, Cli, be an Zip ( (Ud, ii, 
[tree, bag], 
['Decision tree', 'Bagging']): 
ClListit(x%: train, Yoerain) 


PARP RP RB 
~ 


i = Gli epreoice (pecs | xx.avel (7 yyeravel() 13 
Z = Z4.reshape (xx. shape) 
axarr [10xX)|..COntLoOurt (xx, yy; 4, aloha=0..3) 
axerr [10x |.SCaltter(xX trainiy train==0, 01, 
xm Evainly erain== 0, | 9 
c="'blue', marker='%*"') 
exer [10% |~oCalver (© Craiily treain=—=hy, Ol, 
xX UlLainly trainarly di, 
c='green', marker='o') 
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o4 axerr (10x | set title (ce) 
Poo exalt | 0|.ser Viebel(*Alconol", fontsi.z6—12) 
Por Pits texe ClLU.Z, =—LeZ, 
s='0D280/0D315 of diluted wines', 
Ses ha='center', va='center', fontsize=12) 
>>> plt.show() 


As we can see 1n the resulting plot, the piece-wise linear decision boundary of the 
three-node deep decision tree looks smoother in the bagging ensemble: 


Decision tree Bagging 





Alcohol 





+ 12 13 14 15 11 12 13 14 15 
OD280/0D315 of diluted wines 





We only looked at a very simple bagging example in this section. In practice, more 
complex classification tasks and a dataset's high dimensionality can easily lead to 
overfitting in single decision trees, and this is where the bagging algorithm can really 
play to its strengths. Finally, we shall note that the bagging algorithm can be an 
effective approach to reduce the variance of a model. However, bagging is 
ineffective in reducing model bias, that is, models that are too simple to capture the 
trend in the data well. This is why we want to perform bagging on an ensemble of 
classifiers with low bias, for example, unpruned decision trees. 
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Leveraging weak learners via adaptive 
boosting 


In this last section about ensemble methods, we will discuss boosting with a special 
focus on its most common implementation, AdaBoost (Adaptive Boosting). 


Note 


The original idea behind AdaBoost was formulated by Robert E. Schapire in 1990. 
The Strength of Weak Learnability, R. E. Schapire, Machine Learning, 5(2): 197- 
227, 1990. After Robert Schapire and Yoav Freund presented the AdaBoost 
algorithm in the Proceedings of the Thirteenth International Conference (ICML 
1996), AdaBoost became one of the most widely used ensemble methods in the years 
that followed (Experiments with a New Boosting Algorithm by Y. Freund, R. E. 
Schapire, and others, CML, volume 96, 148-156, 1996). In 2003, Freund and 
Schapire received the Goedel Prize for their groundbreaking work, which is a 
prestigious prize for the most outstanding publications in the field of computer 
science. 


In boosting, the ensemble consists of very simple base classifiers, also often referred 
to as weak learners, which often only have a slight performance advantage over 
random guessing—a typical example of a weak learner is a decision tree stump. The 
key concept behind boosting is to focus on training samples that are hard to classify, 
that is, to let the weak learners subsequently learn from misclassified training 
samples to improve the performance of the ensemble. 


The following subsections will introduce the algorithmic procedure behind the 
general concept boosting and a popular variant called AdaBoost. Lastly, we will use 
scikit-learn for a practical classification example. 
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How boosting works 


In contrast to bagging, the initial formulation of boosting, the algorithm uses random 
subsets of training samples drawn from the training dataset without replacement; the 
original boosting procedure is summarized in the following four key steps: 


1. Draw arandom subset of training samples di without replacement from 


4 


training set PD) to train a weak learner C, 


2. Draw a second random training subset a; without replacement from the 
training set and add 50 percent of the samples that were previously 


4 


+ 


misclassified to train a weak learner -. 


4 


3. Find the training samples a; in training set D | which C, and C, disagree 


4 


upon, to train a third weak learner ~°*. 


4 4 4 


4. Combine the weak learners C, , ~,and °* via majority voting. 


As discussed by Leo Breiman (Bias, variance, and arcing classifiers, L. Breiman, 
1996), boosting can lead to a decrease 1n bias as well as variance compared to 
bagging models. In practice, however, boosting algorithms such as AdaBoost are 
also known for their high variance, that is, the tendency to overfit the training data 
(An improvement of AdaBoost to avoid overfitting, G. Raetsch, T. Onoda, and K. R. 
Mueller. Proceedings of the International Conference on Neural Information 
Processing, CiteSeer, 1998). 


In contrast to the original boosting procedure as described here, AdaBoost uses the 
complete training set to train the weak learners where the training samples are 
reweighted in each iteration to build a strong classifier that learns from the mistakes 
of the previous weak learners in the ensemble. Before we dive deeper into the 
specific details of the AdaBoost algorithm, let's take a look at the following figure to 
get a better grasp of the basic concept behind AdaBoost: 
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To walk through the AdaBoost illustration step by step, we start with subfigure 1, 
which represents a training set for binary classification where all training samples are 
assigned equal weights. Based on this training set, we train a decision stump (shown 
as a dashed line) that tries to classify the samples of the two classes (triangles and 
circles), as well as possibly by minimizing the cost function (or the impurity score in 
the special case of decision tree ensembles). 


For the next round (subfigure 2), we assign a larger weight to the two previously 
misclassified samples (circles). Furthermore, we lower the weight of the correctly 
classified samples. The next decision stump will now be more focused on the 
training samples that have the largest weights—the training samples that are 
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supposedly hard to classify. The weak learner shown in subfigure 2 misclassifies 
three different samples from the circle class, which are then assigned a larger weight, 
as shown in subfigure 3. 


Assuming that our AdaBoost ensemble only consists of three rounds of boosting, we 
would then combine the three weak learners trained on different reweighted training 
subsets by a weighted majority vote, as shown in subfigure 4. 


Now that have a better understanding behind the basic concept of AdaBoost, let's 
take a more detailed look at the algorithm using pseudo code. For clarity, we will 


( 


denote element-wise multiplication by the cross symbol and the dot-product 


between two vectors by a dot symbol (-) 


we 
1. Set the weight vector w to uniform weights, where bi 
2. For 1nm boosting rounds, do the following: a. Train a nebuised weak learner: 


C. =train| A. v.w nee CE aE) 
: ( J ) . b. Predict class labels: = pee (C, x) C 


E=w (y £ “7 


Compute weighted error rate: . d. Compute coefficient: 


l-<é , 
a, =0.5 log — W i= Wx exp(—ar xX px y) 
&  .e. Update weights: i 


wi=w/ >: Ww 


Normalize weights to sum to I: 


r=(>-" | (a, xpredict(C, .x))> 0) 


3. Compute the final prediction: 


Note that the expression (y# y) in step 2c refers to a binary vector consisting of Is 


and Os, where a | is assigned if the prediction is incorrect and 0 1s assigned 
otherwise. 


Although the AdaBoost algorithm seems to be pretty straightforward, let's walk 
through a more concrete example using a training set consisting of 10 training 
samples, as illustrated in the following table: 
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Weights Yy(x <= 3.0)? Correct? Updated 
weights 





The first column of the table depicts the sample indices of training samples | to 10. 
In the second column, we see the feature values of the individual samples, assuming 


this is a one-dimensional dataset. The third column shows the true class label, y 


. ae Vy. €: 4, —l} 
for each training sample ', where * ' . The initial weights are shown in 


the fourth column; we initialize the weights uniformly (assigning the same constant 
value) and normalize them to sum to one. In the case of the 10-sample training set, 


we therefore assign 0.1 to each weight in the weight vector w. The predicted 


class labels ¥ are shown in the fifth column, assuming that our splitting criterion 1s 


x 3.0. The last column of the table then shows the updated weights based on the 
update rules that we defined in the pseudo code. 


Since the computation of the weight updates may look a little bit complicated at first, 
we will now follow the calculation step by step. We start by computing the weighted 


error rate © as described in step 2c: 
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--0,1x04+0,1x04+0.1x040,1%040.1%040.1x040.1~140.1«1 


3 
+0.1x14+0.1x0=— =0.3 
10 


Next, we compute the coefficient -“ —shown in step 2d—which is later used in step 
2e to update the weights, as well as for the weights in the majority vote prediction 
(step 4): 


(l-e \ iene 
a, =0.5log| |=0.424 





oc 


a, 
After we have computed the coefficient / , we can now update the weight vector 
using the following equation: 


Weii=WACXD (—a i; y ras y) 


yxy. | Ce. , 
Here, V*S is an element-wise multiplication between the vectors of the predicted 
and true class labels, respectively. Thus, if a prediction ~' is correct,” ' ~*~! will 


have a positive sign so that we decrease the ith weight, since / isa positive 
number as well: 


0.1x exp(—0.424x1x1) = 0.065 


‘- x 
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Similarly, we will increase the ith weight if yi predicted the label incorrectly, like 
this: 


0.1xexp(-0.424x1x(-1)) = 0.153 


Alternatively, it's like this: 


0.1x exp(—0.424x(-1)x(1)) 0.153 


After we have updated each weight in the weight vector, we normalize the weights 
so that they sum up to one (step 2f): 





7 Y° w, =7x0.065+3x0.153=0.914 
ere, = | 


Thus, each weight that corresponds to a correctly classified sample will be reduced 


from the initial value of 0.1 to 9-005 / 0.914 = 0.071 for the next round of 
boosting. Similarly, the weights of the incorrectly classified samples will increase 


fom 0.1 to 0-153/ 0.914 = 0.167. 
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Applying AdaBoost using scikit-learn 


The previous subsection introduced AdaBoost 1n a nutshell. Skipping to the more 
practical part, let's now train an AdaBoost ensemble classifier via scikit-learn. We 
will use the same Wine subset that we used 1n the previous section to train the 
bagging meta-classifier. Via the base estimator attribute, we will train the 
AdaBoostClassifier on 500 decision tree stumps: 


>>> from sklearn.ensemble import AdaBoostClassifier 
>>> tree = DecisionTreeClassifier(criterion='entropy', 
ranoom Sstare—l, 
oe max depth=1) 
Por aoa = AGaBGOs tClassit ier (base estimaclor=tres, 
fn SsStimeators=sU00, 
ISarming tale=U, 1, 
<e-a ranoom state=_) 
Pee. eee. = eee bois, Clot 7 Pee) 
Por y trait, pred = Tres.predicr(x® Train) 
Pee Vf Pest. Pred = Gree. plecice(. Leou) 
vor TESS Urait. = accuracy Seorel(y train, YY train: pred) 
woe Taeee eee = 2CCUreaCcy Score (y tee, 7 tec preg) 
>>> print ('Decision tree train/test accuracies %.3f/%.3f£' 
ne os (Glee train; Tiree test)) 
Decision tree train/test accuracies 0.916/0.875 


As we can see, the decision tree stump seems to underfit the training data in contrast 
to the unpruned decision tree that we saw in the previous section: 


eo GaGa. = a0astie( xX Crain, Y train) 

yor VY Crain pred = eda.predici(x Crain) 

Por YY eel prec >= ada.predicre( x Test) 

PoP mOe tain = @ccuracy Sscore(y Train, 7.train. pred) 
veer Goa Looe = eCCcurecy. SCOrety ©ecty ¥ Tesl. pred) 
>>> print("Adaboost train7test accuracies G.3T/s. Ff ' 


6 (aCe Train, ada Les) ) 
ResBeoe: train/test accuracies 1.000/0.917 


As we can see, the AdaBoost model predicts all class labels of the training set 
correctly and also shows a slightly improved test set performance compared to the 
decision tree stump. However, we also see that we introduced additional variance by 
our attempt to reduce the model bias—a higher gap between training and test 
performance. 


Although we used another simple example for demonstration purposes, we can see 
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that the performance of the AdaBoost classifier is slightly improved compared to the 
decision stump and achieved the very similar accuracy scores as the bagging 
classifier that we trained in the previous section. However, we shall note that it 1s 
considered bad practice to select a model based on the repeated usage of the test set. 
The estimate of the generalization performance may be over-optimistic, which we 
discussed in more detail in Chapter 6, Learning Best Practices for Model Evaluation 
and Hyperparameter Tuning. 


Lastly, let us check what the decision regions look like: 


por By YY = DPsmeshnor1a(nprerance(x min, x max, 0.1) 


Vy Min, y max, O«1)) 


vor 5 = heey Ul ee) Se 
Poe x Max = X Crain, ¢, Ul.«max() @ 1 
Zee I =. Ete sy a) = 
po? Y Max. = x erat s, 2) emen) a J 
( 
( 


ee np.arange 
eee Ty, dxarre = Dle.subplots(l, 2, 
Sharex="ColL*; 
sharey='row', 
ee figsize=(8, 3)) 
yee FOL 10x, Clr, te an Zap ClU, Ji, 
[tree, ada], 
['Decision Tree', 'AdaBoost']): 
Clistit(x trainy yy trarn) 
Z.= Cli .predicl(mp«C [xxetavel(), vyVyeravel() |) 
ZL = Z.reshape (xx.shape) 
axarr[idx].contourf(xx, yy, Z, alpha=0.3) 
axerl (10x |~SCacter(x Crain y train=—0, Ol), 
X train[y train==-0, 1], 
c='blue', 
marker='%*"') 


axearr [10x] sSCacler(xX trainily train==l, U1), 
x Erelaly Ceatne=L, 1, 
c='red', 


marker='0o') 
exer | 10x) .sel Cre le (ee) 

28 axarr[0].set ylabel('Alcohol', fontsize=12) 
POS Dea cexG Ley =Uiedy 

s='0D280/0D315 of diluted wines', 

Na="Center”, 

Va=' Center", 
a fontsize=12) 
>>> plt.show() 


By looking at the decision regions, we can see that the decision boundary of the 
AdaBoost model is substantially more complex than the decision boundary of the 
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decision stump. In addition, we note that the AdaBoost model separates the feature 
om very similarly to the the bagging classifier that we trained in the previous section: 


Ada Boost 


Alcohol 





| 12 13 14 15 11 12 13 14 15 
OD280/0D315 of diluted wines 





As concluding remarks about ensemble techniques, it 1s worth noting that ensemble 
learning increases the computational complexity compared to individual classifiers. 
In practice, we need to think carefully about whether we want to pay the price of 
increased computational costs for an often relatively modest improvement in 
predictive performance. 


An often-cited example of this trade-off is the famous $1 million Netflix Prize, which 
was won using ensemble techniques. The details about the algorithm were published 
in The BigChaos Solution to the Netflix Grand Prize by A. Toescher, M. Jahrer, and 
R. M. Bell, Netflix prize documentation, 2009, which 1s available at 

pdf. The winning 
baie aed the $1 million grand prize money; however, Netflix never 
implemented their model due to its complexity, which made it infeasible for a real- 
world application: 





"We evaluated some of the new methods offline but the additional accuracy gains 
that we measured did not seem to justify the engineering effort needed to bring them 
into a production environment." (http://techblog.netflix.com/2012/04/netflix- 


recommendations-beyond-5-stars.html). 
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Summary 


In this chapter, we looked at some of the most popular and widely used techniques 
for ensemble learning. Ensemble methods combine different classification models to 
cancel out their individual weaknesses, which often results in stable and well- 
performing models that are very attractive for industrial applications as well as 
machine learning competitions. 


At the beginning of this chapter, we implemented Maj orityVoteClassifier In 
Python, which allows us to combine different algorithms for classification. We then 
looked at bagging, a useful technique to reduce the variance of a model by drawing 
random bootstrap samples from the training set and combining the individually 
trained classifiers via majority vote. Lastly, we learned about AdaBoost, which is an 
algorithm that is based on weak learners that subsequently learn from mistakes. 


Throughout the previous chapters, we learned a lot about different learning 
algorithms, tuning, and evaluation techniques. In the next chapter, we will look at a 
particular application of machine learning, sentiment analysis, which has become an 
interesting topic in the internet and social media era. 
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Chapter 8. Applying Machine Learning 
to Sentiment Analysis 


In this internet and social media age, people's opinions, reviews, and 
recommendations have become a valuable resource for political science and 
businesses. Thanks to modern technologies, we are now able to collect and analyze 
such data most efficiently. In this chapter, we will delve into a subfield of Natural 
Language Processing (NLP) called sentiment analysis and learn how to use 
machine learning algorithms to classify documents based on their polarity: the 
attitude of the writer. In particular, we are going to work with a dataset of 50,000 
movie reviews from the Internet Movie Database (IMDb) and build a predictor 
that can distinguish between positive and negative reviews. 


The topics that we will cover in the following sections include the following: 


e Cleaning and preparing text data 

e Building feature vectors from text documents 

e Training a machine learning model to classify positive and negative movie 
reviews 

e Working with large text datasets using out-of-core learning 

e Inferring topics from document collections for categorization 
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Preparing the IMDb movie review data 
for text processing 


Sentiment analysis, sometimes also called opinion mining, is a popular 
subdiscipline of the broader field of NLP; it 1s concerned with analyzing the polarity 
of documents. A popular task in sentiment analysis 1s the classification of documents 
based on the expressed opinions or emotions of the authors with regard to a 
particular topic. 


In this chapter, we will be working with a large dataset of movie reviews from the 
IMDb that has been collected by Maas and others (Learning Word Vectors for 
Sentiment Analysis, A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. 
Potts, Proceedings of the 49th Annual Meeting of the Association for Computational 
Linguistics: Human Language Technologies, pages 142—150, Portland, Oregon, 
USA, Association for Computational Linguistics, June 20/1). The movie review 
dataset consists of 50,000 polar movie reviews that are labeled as either positive or 
negative; here, positive means that a movie was rated with more than six stars on 
IMDb, and negative means that a movie was rated with fewer than five stars on 
IMDb. In the following sections, we will download the dataset, preprocess it into a 
useable format for machine learning tools, and extract meaningful information from 
a subset of these movie reviews to build a machine learning model that can predict 
whether a certain reviewer liked or disliked a movie. 
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Obtaining the movie review dataset 


A compressed archive of the movie review dataset (84.1 MB) can be downloaded 
from http://ai.stanford.edu/~amaas/data/sentiment/ as a Gzip-compressed tarball 
archive: 


e If you are working with Linux or macOS, you can open a new Terminal 
window, cd into the download directory, and execute tar -zxf 
aclImdb vl1.tar.gz to decompress the dataset. 

e If you are working with Windows, you can download a free archiver such as 
7Zip (http://www.7-zip.org) to extract the files from the download archive. 

e Alternatively, you can directly unpack the Gzip-compressed tarball archive 
directly in Python as follows: 


>>> import tarfile 
Pee Wien, TarriLlesOpen( aclimdb. Vislar.O27y ~L2027") as tar. 
tar.extractall () 
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Preprocessing the movie dataset into more 
convenient format 


Having successfully extracted the dataset, we will now assemble the individual text 
documents from the decompressed download archive into a single CSV file.In the 
following code section, we will be reading the movie reviews into a pandas 
DataFrame Object, which can take up to 10 minutes on a standard desktop computer. 
To visualize the progress and estimated time until completion, we will use the 


Python Progress Indicator (PyPrind, https://pypi.python.org/pypi/PyPrind/) 
package that I developed several years ago for such purposes. PyPrind can be 


installed by executing the pip install pyprind command. 


>>> import pyprind 
>>> import pandas as pd 
>>> import os 


>>> # change the ‘basepath* to the directory of the 
>>> # unzipped movie dataset 


>>> basepath = 'aclimdb' 
>>> 
>>> labels = {'pos': 1, ‘neg!: QO} 
>>> pbar = pyprind.ProgBar (50000) 
>>> af = pd.DataFrame () 
moo Tor © am (*Lesu*, Crain). 
for 1 in ('pos', 'neg'): 
path = os.path.join(basepath, s, 1) 
for file in os.listdir(path): 
with open(os.path.join(path, file), 
'r', encoding='utf-8') as infile: 
txt = infile.read() 
dfé = df.append([[txt, labels[l]]], 
IOnore 1ndex—True) 
ims pbar.update () 
>>> df.columns = ['review', 'sentiment' ] 
OS 100% 
LETHE TETET TEETH EET Hee HE] O6c| 6 ETA: 00:00:00 
Total time elapsed: 00:03:37 


In the preceding code, we first initialized a new progress bar object pbar with 50,000 
iterations, which is the number of documents we were going to read in. Using the 
nested for loops, we iterated over the train and test subdirectories in the main 
aclImdb directory and read the individual text files from the pos and neg 
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subdirectories that we eventually appended to the df pandas DataFrame, together 
with an integer class label (1 = positive and 0 = negative). 


Since the class labels in the assembled dataset are sorted, we will now shuffle 
DataFrame using the permutation function from the np. random submodule—this 
will be useful to split the dataset into training and test sets in later sections when we 
will stream the data from our local drive directly. For our own convenience, we will 
also store the assembled and shuffled movie review dataset as a CSV file: 


>>> import numpy as np 


>>> np.random. seed (0) 
>>> df = df.reindex(np.random.permutation (df.index) ) 
PrP OlLetO: CSV MOVLG Cata~CSV",; TNCex=False, encoding=urt—-3) 


Since we are going to use this dataset later 1n this chapter, let's quickly confirm that 
we have successfully saved the data in the right format by reading in the CSV and 
printing an excerpt of the first three samples: 


Por OL = po.'ead Cev( Movie GCava.Cov » SNCoclag="ult=-o") 
vee Oleliead (2) 


If you are running the code examples in a Jupyter Notebook, you should now see the 
first three samples of the dataset, as shown in the following table: 


review sentiment 


0 In 1974, the teenager Martha Moxley (Maggie Gr... 


OK... so... | really like Kris Kristofferson a... 


““SPOILER*™* Do not read this, if you think a... 
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Introducing the bag-of-words model 


You may remember from Chapter 4, Building Good Training Sets — Data 
Preprocessing, that we have to convert categorical data, such as text or words, into a 
numerical form before we can pass it on to a machine learning algorithm. In this 
section, we will introduce the bag-of-words, which allows us to represent text as 
numerical feature vectors. The idea behind the bag-of-words model is quite simple 
and can be summarized as follows: 


1. We create a vocabulary of unique tokens—for example, words—from the entire 
set of documents. 

2. We construct a feature vector from each document that contains the counts of 
how often each word occurs in the particular document. 


Since the unique words in each document represent only a small subset of all the 
words in the bag-of-words vocabulary, the feature vectors will mostly consist of 
zeros, which 1s why we call them sparse. Do not worry if this sounds too abstract; in 
the following subsections, we will walk through the process of creating a simple bag- 
of-words model step-by-step. 


WOW! eBook 
www.wowebook.org 


Transforming words into feature vectors 


To construct a bag-of-words model based on the word counts in the respective 
documents, we can use the CountVectorizer class implemented 1n scikit-learn. As 
we will see in the following code section, countVectorizer takes an array of text 
data, which can be documents or sentences, and constructs the bag-of-words model 
for us: 


>>> import numpy as np 
Por TPOmM. Skiearm.« teacture ext racti1onetext IMpPOLre COUNT VeCclOr1zer 
>>> count = CountVectorizer () 
>>> docs = np.array([ 
'The sun is shining", 
'The weather is sweet', 
-an 'The sun is shining and the weather is sweet']) 
Per DaG = COUNE. fit TCranstorm (docs) 


By calling the fit transform method on CountVectorizer, we constructed the 
vocabulary of the bag-of-words model and transformed the following three sentences 
into sparse feature vectors: 


e 'The sun is shining' 
® 'The weather is sweet' 


e 'The sun is shining, the weather is sweet, and one and one is two' 


Now let's print the contents of the vocabulary to get a better understanding of the 
underlying concepts: 


Por Prine (COuUnL.~VoOCcabulery. ) 


{'and': 0, 
“EWO” 2 i 
"shining!: 3, 
'Ome” 3 2, 
"San =< 4, 
'weather': 8, 
Chie” = iG, 
"sweet!: 5, 
ries 4} 


As we can see from executing the preceding command, the vocabulary is stored in a 
Python dictionary that maps the unique words to integer indices. Next, let's print the 
feature vectors that we just created: 
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>>> print (bag.toarray() ) 
[{[O 1 0 1.1 0 1 0 DO] 
aoa eee toe a 
23222 a2 2 4)) 


Each index position in the feature vectors shown here corresponds to the integer 
values that are stored as dictionary items in the CountVectorizer vocabulary. For 
example, the first feature at index position 0 resembles the count of the word 'and', 
which only occurs in the last document, and the word 'is', at index position 1 (the 
second feature in the document vectors), occurs 1n all three sentences. These values 


if(t.d) 


in the feature vectors are also called the raw term frequencies: the 


number of times a term ¢ occurs in a document d. 


Note 


The sequence of items in the bag-of-words model that we just created is also called 
the 1-gram or unigram model—each item or token in the vocabulary represents a 
single word. More generally, the contiguous sequences of items in NLP—words, 
letters, or symbols—are also called n-grams. The choice of the number 7 1n the n- 
gram model depends on the particular application; for example, a study by Kanaris 
and others revealed that n-grams of size 3 and 4 yield good performances in anti- 
spam filtering of email messages (Words versus character n-grams for anti-spam 
filtering, Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas, and Efstathios 
Stamatatos, International Journal on Artificial Intelligence Tools, World Scientific 
Publishing Company, 16(06): 1047-1067, 2007). To summarize the concept of the n- 
gram representation, the 1-gram and 2-gram representations of our first document 
"the sun is shining" would be constructed as follows: 


- l-gram: "the", "sun", "18", "shining" 


e 2-gram: "the sun", "sun is", "1s shining" 
The countVectorizer Class in scikit-learn allows us to use different n-gram models 
via its ngram range parameter. While a 1-gram representation is used by default, we 
could switch to a 2-gram representation by initializing a new CountVectorizer 
instance with ngram range=(2,2). 
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Assessing word relevancy via term frequency- 
inverse document frequency 


When we are analyzing text data, we often encounter words that occur across 
multiple documents from both classes. These frequently occurring words typically 
don't contain useful or discriminatory information. In this subsection, we will learn 
about a useful technique called term frequency-inverse document frequency (tf- 
idf) that can be used to downweight these frequently occurring words in the feature 
vectors. The tf-idf can be defined as the product of the term frequency and the 
inverse document frequency: 


tf-idf (t,d) = ¢f (t,d ) xidf (t,d) 


Here the #/(¢, d) 1s the term frequency that we introduced in the previous section, and 
idf(t, d) 1s the inverse document frequency and can be calculated as follows: 


a a ; I 
idf (t,d) =/og TT 


Here "@ is the total number of documents, and df(d, t) is the number of documents d 
that contain the term ¢. Note that adding the constant / to the denominator 1s optional 
and serves the purpose of assigning a non-zero value to terms that occur in all 
training samples; the /og is used to ensure that low document frequencies are not 
given too much weight. 


The scikit-learn library implements yet another transformer, the TfidfTransformer 
class, that takes the raw term frequencies from the countVectorizer Class as input 
and transforms them into tf-idfs: 


Ze? LLOM SklGarnelSGalrure CxtrecriOn.text 12mport Tiidrlrvansrormer 
eer UllOr = TEratTrenstormer (use 1act=11ue, 
norm='12', 
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ent smooth i1df=True) 
27 NPwsel. PEIMNLOPLLONS (preci c1oOn=zZ) 
Por PLIMEAL EAGLE. tit Teanstorm(Count.T it Cranstorm (Goces):) 


ae .toarray() ) 

Pe; (hs 0.43 Q. 0.50 OUs56. 0. 0.43 Q. OO. | 
. Oh 0.43 OQ. QO. QO. 0.56 0.43 Q. «6: | 
[| Q.5 0.45 0.5 rd Wigley WW Is Vee Ve lol 


As we Saw in the previous subsection, the word 'is' had the largest term frequency 
in the third document, being the most frequently occurring word. However, after 
transforming the same feature vector into tf-idfs, we see that the word 'is' is now 
associated with a relatively small tf-idf (0.45) in the third document, since it is also 
present in the first and second document and thus 1s unlikely to contain any useful 
discriminatory information. 


However, if we'd manually calculated the tf-idfs of the individual terms in our 
feature vectors, we'd notice that TfidfTransformer calculates the tf-idfs slightly 
differently compared to the standard textbook equations that we defined previously. 
The equations for the inverse document frequency implemented 1n scikit-learn 1s 
computed as follows: 


idf (t,d) =/og tN, ) 


| + df (d,t) 


Similarly, the tf-idf computed in scikit-learn deviates slightly from the default 
equation we defined earlier: 


tf-idf (t,d) = of (td) x(idf (td) +1) 


While it 1s also more typical to normalize the raw term frequencies before 
calculating the tf-idfs, TfidfTransformer Class normalizes the tf-idfs directly. By 
default (norm='12"'), scikit-learn's TfidfTransformer applies the L2-normalization, 
which returns a vector of length | by dividing an un-normalized feature vector v by 
its L2-norm: 
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To make sure that we understand how TfidfTransformer works, let's walk through 
an example and calculate the tf-idf of the word 'is' in the third document. 


The word 'is' has a term frequency of 3 (#/=3) in the third document, and the 
document frequency of this term 1s 3 since the term 'is' occurs in all three 
documents (df=3). Thus, we can calculate the inverse document frequency as 
follows: 


idf ("is",d3) = log ce 0 





1+3 


Now, in order to calculate the tf-idf, we simply need to add / to the inverse 
document frequency and multiply it by the term frequency: 


tf-idf ("is",d3) =3x(0+1)=3 


If we repeated this calculation for all terms in the third document, we'd obtain the 
following tf-idf vectors: /3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]. However, 
notice that the values in this feature vector are different from the values that we 
obtained from TfidfTransformer that we used previously. The final step that we are 
missing in this tf-idf calculation 1s the L2-normalization, which can be applied as 
follows: 


_— [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29] 
tt -idt (d3 i = 


3.397 +3.0° +3.39° +1.29° +1297 +1.29° +2.0° +1.69° +1.29° 
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=|0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19] 


tf-idf ("is",d3) = 0.45 


As we can see, the results now match the results returned by scikit-learn's 
Tf£idfTransformer, and since we now understand how tf-idfs are calculated, let's 


proceed to the next section and apply those concepts to the movie review dataset. 
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Cleaning text data 


In the previous subsections, we learned about the bag-of-words model, term bag-of- 
words model, term frequencies, and tf-idfs. However, the first important step— 
before we build our bag-of-words model—is to clean the text data by stripping it of 
all unwanted characters. To illustrate why this 1s important, let's display the last 50 
characters from the first document in the reshuffled movie review dataset: 


Soo Otslbocl i, “review | [=50.) 
"is seven.<br /><br />Title (Brazil): Not Available' 


As we can see here, the text contains HTML markup as well as punctuation and 
other non-letter characters. While HTML markup does not contain much useful 
semantics, punctuation marks can represent useful, additional information in certain 
NLP contexts. However, for simplicity, we will now remove all punctuation marks 
except for emoticon characters such as :) since those are certainly useful for 
sentiment analysis. To accomplish this task, we will use Python's regular 
expression (regex) library, re, as shown here: 


>>> import re 
Por OCT Preprocessor (Text): 


text = re.sub('<[%*%>]*>', '"', text) 
emoticons = re.findall('(?::/7; |=) (?:-)?(?:\) J\C/DIP)', 
Lex) 
text = (re.sub('[\W]+', ' ', text.lower()) + 
' ' ,jJoin(emoticons).replace('-', '')) 


return text 


Via the first regex <[*>]*> 1n the preceding code section, we tried to remove all of 
the HTML markup from the movie reviews. Although many programmers generally 
advise against the use of regex to parse HTML, this regex should be sufficient to 
clean this particular dataset. After we removed the HTML markup, we used a 
slightly more complex regex to find emoticons, which we temporarily stored as 
emoticons. Next, we removed all non-word characters from the text via the regex 
[\w]+ and converted the text into lowercase characters. 


Note 


In the context of this analysis, we assume that the capitalization of a word—for 
example, whether it appears at the beginning of a sentence—does not contain 
semantically relevant information. However, note that there are exceptions, for 
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instance, we remove the notation of proper names. But again, in the context of this 
analysis, it is a simplifying assumption that the letter case does not contain 
information that 1s relevant for sentiment analysis. 


Eventually, we added the temporarily stored emoticons to the end of the processed 
document string. Additionally, we removed the nose character (-) from the 
emoticons for consistency. 


Note 


Although regular expressions offer an efficient and convenient approach to searching 
for characters in a string, they also come with a steep learning curve. Unfortunately, 
an in-depth discussion of regular expressions 1s beyond the scope of this book. 
However, you can find a great tutorial on the Google Developers portal at 


https://developers.google.com/edu/python/regular-expressions or check out the 
official documentation of Python's re module at 


https://docs.python.org/3.6/library/re.html. 


Although the addition of the emoticon characters to the end of the cleaned document 
strings may not look like the most elegant approach, we shall note that the order of 
the words doesn't matter in our bag-of-words model if our vocabulary consists of 
only one-word tokens. But before we talk more about the splitting of documents into 
individual terms, words, or tokens, let's confirm that our preprocessor works 
correctly: 


Po? Preprocessor (di.loc|lU, “review |.|-5o02)) 
"is seven title brazil not available' 
>>> preprocessor ("</a>This :) is :( a test :-)!") 


‘this 18 a test. *) 2( 2) 


Lastly, since we will make use of the cleaned text data over and over again during 
the next sections, let us now apply our preprocessor function to all the movie 
reviews 1n Our DataFrame: 


>>> df['review'] = df['review'] .apply (preprocessor) 
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Processing documents into tokens 


After successfully preparing the movie review dataset, we now need to think about 
how to split the text corpora into individual elements. One way to tokenize 
documents is to split them into individual words by splitting the cleaned documents 
at its whitespace characters: 


>>> def tokenizer (text): 

: return text.split() 

>>> tokenizer('runners like running and thus they run') 
['runners', ‘'like', ‘'running', ‘'and', 'thus', 'they', ‘run'] 


In the context of tokenization, another useful technique is word stemming, which is 
the process of transforming a word into its root form. It allows us to map related 
words to the same stem. The original stemming algorithm was developed by Martin 
F. Porter in 1979 and 1s hence known as the Porter stemmer algorithm (An 
algorithm for suffix stripping, Martin F. Porter, Program: Electronic Library and 
Information Systems, 14(3): 130-137, 1980). The Natural Language Toolkit 
(NLTK, http://www.nitk.org) for Python implements the Porter stemming algorithm, 
which we will use in the following code section. In order to install the NLTK, you 
can simply execute conda install nltkOrpip install nltk. 


Note 


Although the NLTK 1s not the focus of the chapter, I highly recommend that you 
visit the NLTK website as well as read the official NLTK book, which 1s freely 
available at http://www.nltk.org/book/, if you are interested in more advanced 
applications in NLP. 


The following code shows how to use the Porter stemming algorithm: 


>>> from nltk.stem.porter import PorterStemmer 

>>> porter = PorterStemmer () 

por Ost LTOkKenilzZer porter. (Cex) ; 

: return [porter.stem(word) for word in text.split() ] 

yor LOKSNIAZer porter (runners Jake £unnang and tus they tun”) 
f*runner', li ke*,; * run, “and’, thu’, "they*, “sun” 


Using the PorterStemmer from the nltk package, we modified our tokenizer 
function to reduce words to their root form, which was illustrated by the simple 
preceding example where the word 'running' was stemmed to its root form 'run'. 
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Note 


The Porter stemming algorithm is probably the oldest and simplest stemming 
algorithm. Other popular stemming algorithms include the newer Snowball 
stemmer (Porter2 or English stemmer) and the Lancaster stemmer (Paice/Husk 
stemmer), which is faster but also more aggressive than the Porter stemmer. These 
alternative stemming algorithms are also available through the NLTK package 


(http://www.nitk.org/api/nitk.stem.html). 


While stemming can create non-real words, such as 'thu' (from 'thus'), as shown 
in the previous example, a technique called lemmatization aims to obtain the 
canonical (grammatically correct) forms of individual words—the so-called lemmas. 
However, lemmatization is computationally more difficult and expensive compared 
to stemming and, in practice, it has been observed that stemming and lemmatization 
have little impact on the performance of text classification Unfluence of Word 
Normalization on Text Classification, Michal Toman, Roman Tesar, and Karel 
Jezek, Proceedings of InSciT, pages 354-358, 2006). 


Before we jump into the next section, where we will train a machine learning model 
using the bag-of-words model, let's briefly talk about another useful topic called 
stop-word removal. Stop-words are simply those words that are extremely common 
in all sorts of texts and probably bear no (or only little) useful information that can 
be used to distinguish between different classes of documents. Examples of stop- 
words are is, and, has, and like. Removing stop-words can be useful if we are 
working with raw or normalized term frequencies rather than tf-idfs, which are 
already downweighting frequently occurring words. 


In order to remove stop-words from the movie reviews, we will use the set of 127 
English stop-words that 1s available from the NLTK library, which can be obtained 
by calling the nltk.download function: 


>>> import nitk 


>>> niltk.download('stopwords') 


After we download the stop-words set, we can load and apply the English stop-word 
set as follows: 


>>> from nltk.corpus import stopwords 
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>>> stop = stopwords.words ('english') 


2? lw tOr W in COKeEN1IZer porter(*’a runner Likes Lunning and Tuns 2a 
LO }'[=L0¢] 6 WmOe. 2h Stop] 


f rorier". *Rike”, *irmn >; “ron, ~jhor* 
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Training a logistic regression model for 
document classification 


In this section, we will train a logistic regression model to classify the movie reviews 
into positive and negative reviews. First, we will divide the DataFrame of cleaned 
text documents into 25,000 documents for training and 25,000 documents for testing: 


noe oe eit = OL .1-0e | .2o000, “Leview’ lave lues 
Peo yy tein = Cl.«1OCl s25000, “Sentiment” |.Values 
Po? Vest = ClE.1OC|Z00005, “<eview” | «Values 

yor YY vest. = OL. lOeC ZOU007, “Sentiment: |..Values 


Next, we will use a GridSearchcv object to find the optimal set of parameters for our 
logistic regression model using 5-fold stratified cross-validation: 


ze? Lrom Skicarn.model selection amperl Graig oearcncy 

>>> from sklearn.pipeline import Pipeline 

vo? LLOMm sk Learn«lLlanear Model IMpOrt LbOGLSTACRegressi0n 

pe? TOOM: SK ibGarO.Teeture CxUCrac ti ONsetexe 1mpore TiicrVector1Zer 


yee CLIO = TEPOrVeclOrizer (Sri. sccents=None, 
lowercase=False, 

See preprocessor=None) 

Poo. Param Grid = |4"*Vect noram ,ange* =. [(lpt) ly 


"VeECtL Stop worgs’: [Stop,; None), 
‘MSCE COKeNITZer’> |COKenIZSer, 
tokenizer porter], 
"Clit, Denalmy* = PLL, ~“L2* is 
(Cle Crs ey. 120, WoO) a 
{ vect noram range’: |[{1,1)], 
"VCCU Stop woOrgos”’:. [stop, None), 
‘VSCL, COKGNIZer"? [|TOKenIZer, 
LOKEHhILZSr porter], 
"VECC. Use 100"s | Palsel, 
“VSCU. Norm =:|None] , 
"CLE  peneliy*s di, *h2)5 
Veit. Cs iis, 20.0, 100.01 5 
ee ] 
Po? Ae Eb = Pipeline | veces Lei), 
Cen. 


‘oe hOgGistacRegress10n (random. stalSe=0)) ]) 
poe Os Jf Chic = CricoearcnCy (17 trict, param. Grid, 
Scoring="accuracy', 
Cv=5, verbose=l1, 
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Se pon a, JObs= 1) 
Zoe Oe We CEOs ti “Eteam, Y eisaat)) 


Tip 


Please note that it 1s highly recommended to set n_ jobs=-1 (instead of n jobs=1) in 
the previous code example to utilize all available cores on your machine and speed 
up the grid search. However, some Windows users reported issues when running the 
previous code with the n jobs=-1 setting related to pickling the tokenizer and 
tokenizer porter functions for multiprocessing on Windows. Another workaround 
would be to replace those two functions, [tokenizer, tokenizer porter], with 
[str.split]. However, note that the replacement by the simple str.split would 
not support stemming. 


When we initialized the GridSearchcv object and its parameter grid using the 
preceding code, we restricted ourselves to a limited number of parameter 
combinations, since the number of feature vectors, as well as the large vocabulary, 
can make the grid search computationally quite expensive. Using a standard desktop 
computer, our grid search may take up to 40 minutes to complete. 


In the previous code example, we replaced CountVectorizer and TfidfTransformer 
from the previous subsection with TfidfVectorizer, which combines the latter 
transformer objects. Our param grid consisted of two parameter dictionaries. In the 
first dictionary, we used the TfidfVectorizer with its default settings 

(use idf=True, smooth idf=True, and norm='12"') to calculate the tf-idfs; in the 
second dictionary, we set those parameters to use idf=False, smooth idf=False, 
and norm=None 1n order to train a model based on raw term frequencies. 
Furthermore, for the logistic regression classifier itself, we trained models using L2 
and L1 regularization via the penalty parameter and compared different 
regularization strengths by defining a range of values for the inverse-regularization 
parameter C. 


After the grid search has finished, we can print the best parameter set: 


O 


Por Prine, Best patancter Ser. =«s ~ «= OSs _JI7 ti10Of. best. params ) 
Best. parameter ser; ¢*clt Crs 10.0, *“VeECl Stop words’. Nove, 
"CLT penalty’: “12*, *“VeCct. tokenizer”. <Tunct1on TOKenIzZer av 
Uxifoc/04946e3>, *vect noram range’: {1, 1)] 


As we can see in the preceding output, we obtained the best grid search results using 
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the regular tokenizer without Porter stemming, no stop-word library, and tf-idfs in 
combination with a logistic regression classifier that uses L2-regularization with the 
regularization strength C of 10.0. 


Using the best model from this grid search, let's print the average 5-fold cross- 
validation accuracy scores on the training set and the classification accuracy on the 
test dataset: 


Poo PEILOL( CV ACCuracy: <<. 01" 


O 


os o. Oo Jt EttCr.best Score ) 
GV ACCULracCy: U.69Z 

por Cll = Gs _1f CitGtsbest esl imaror 
>>> print('Test Accuracy: %.3f' 


(e) 


#4 © Clit sSCOre(™% Test, yy test) ) 
Test Accuracy: 0.899 


The results reveal that our machine learning model can predict whether a movie 
review 1s positive or negative with 90 percent accuracy. 


Note 


A still very popular classifier for text classification 1s the Naive Bayes classifier, 
which gained popularity in applications of email spam filtering. Naive Bayes 
classifiers are easy to implement, computationally efficient, and tend to perform 
particularly well on relatively small datasets compared to other algorithms. Although 
we don't discuss Naive Bayes classifiers 1n this book, the interested reader can find 
my article about Naive text classification that I made freely available on arXiv 
(Naive Bayes and Text Classification I — Introduction and Theory, S. Raschka, 
Computing Research Repository (CoRR), abs/1410.5329, 2014, 


http://arxiv.org/pdf/1410.5329v3.pdf). 
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Working with bigger data — online 
algorithms and out-of-core learning 


If you executed the code examples in the previous section, you may have noticed 
that it could be computationally quite expensive to construct the feature vectors for 
the 50,000 movie review dataset during grid search. In many real-world applications, 
it is not uncommon to work with even larger datasets that can exceed our computer's 
memory. Since not everyone has access to supercomputer facilities, we will now 
apply a technique called out-of-core learning, which allows us to work with such 
large datasets by fitting the classifier incrementally on smaller batches of the dataset. 


Back in Chapter 2, Training Simple Machine Learning Algorithms for Classification, 
we introduced the concept of stochastic gradient descent, which 1s an optimization 
algorithm that updates the model's weights using one sample at a time. In this 
section, we will make use of the partial fit function of the scDClassifier in 
scikit-learn to stream the documents directly from our local drive, and train a logistic 
regression model using small mini-batches of documents. 


First, we define a tokenizer function that cleans the unprocessed text data from the 
movie data.csv file that we constructed at the beginning of this chapter and 
separate it into word tokens while removing stop words: 


>>> import numpy as np 

>>> import re 

>>> from nltk.corpus import stopwords 
>>> stop = stopwords.words('english') 
>>> def tokenizer (text): 


text = re.sub('<[%*%>]*>', '"', text) 
emoticons = re.findall('(?::|];]/=) (?:-)?(?:\) J\CIDIP)', 
text.lower () ) 
text = re.sub('[\W]+', ' ', text.lower()) \ 
~— * F.761n (emotrcons).replace('=", *") 
tokenized = [w for w in text.split() if w not in stop] 


return tokenized 
Next, we define a generator function stream docs that reads in and returns one 
document at a time: 


ver GOEL SLream ‘0Cs (pacn) : 
with open(path, 'r', encoding='utf-8') as csv: 
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next(csv) # skip header 

for line in csv: 
text, label = line[:-3], int(line[-2]) 
yield text, label 


To verify that our stream docs function works correctly, let's read in the first 
document from the movie data.csv file, which should return a tuple consisting of 
the review text as well as the corresponding class label: 


per Mex (sticam COCs (pabln= Movie Gatascev)) 
('"In 1974, the teenager Martha Moxley ... ',1) 


We will now define a function, get minibatch, that will take a document stream 
from the stream docs function and return a particular number of documents 
specified by the size parameter: 


Poe Get ‘Get Minibavecn (aoc Stream, size): 
docs, y = [J], [1 
cry 
for dif range (si.7e) > 
vext, label = nexvu(doc stream) 
docs.append (text) 
y.append (label) 
except StoplIteration: 
return None, None 
return docs, y 


Unfortunately, we can't use CountVectorizer for out-of-core learning since it 
requires holding the complete vocabulary in memory. Also, TfidfVectorizer needs 
to keep all the feature vectors of the training dataset in memory to calculate the 
inverse document frequencies. However, another useful vectorizer for text 
processing implemented in scikit-learn is HashingVectorizer. HashingVectorizer 
is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 


function by Austin Appleby (https://sites.zoogle.com/site/murmurhash/): 


Por trom SkieGarnstealcure SxtracliOn.vext Inpore HashingVvectorizer 
Poo LOM. Skibeatnsuitieear model 1mpoOrl. SGPC lassi tle: 
por MOCU = HasningVector sizer (decode 6rror="190nore: , 
MM eae reg-7. 9 ZikG 
PLEEDLOCeSsSOLr—=NoOne, 
. ae tokenizer=tokenizer) 
Pee Clr = SGpClassi tier (loss= "100", Pancom sltate=l, 0 126r- 1) 
Por QOC Stream = Stream Gocs(Ppaln="mMovie data.Ccsyv’ ) 
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You can replace Perceptron(..., n iter=1, ...) by Perceptron(..., 

max iter=1, ...) In scikit-learn versions greater than 0.18. The n_iter parameter 1s 
used here deliberately, because scikit-learn 0.18 is still widely used.Using the 
preceding code, we initialized HashingVectorizer with our tokenizer function and 
set the number of features to 2**21. Furthermore, we reinitialized a logistic 
regression classifier by setting the loss parameter of the scpClassifier to 'log' 
note that by choosing a large number of features in the HashingVectorizer, we 
reduce the chance of causing hash collisions, but we also increase the number of 
coefficients in our logistic regression model. Now comes the really interesting part. 
Having set up all the complementary functions, we can now start the out-of-core 
learning using the following code: 





>>> import pyprind 
>>> phar = pyprind.ProgBar (45) 
>>> classes = np.array([0, 1]) 
>>> for in range(45): 
x Liainy. y Crain = get. Minibatcci(aoc. Stream, 51726-1000) 
ic. MOC: % Crain 
break 
X trait, = vect.transtorm(xX train) 
Clisperitel Fit hein, VY train, Cleosses=C asses) 
or pbar.update () 
OS 100% 
LHHHHTETET TEETH TEETH Eee] O6c| 6 ETA: 00:00:00 
Total time elapsed: 00:00:39 


Again, we made use of the PyPrind package in order to estimate the progress of our 
learning algorithm. We initialized the progress bar object with 45 iterations and, in 
the following for loop, we iterated over 45 mini-batches of documents where each 
mini-batch consists of 1,000 documents. Having completed the incremental learning 
process, we will use the last 5,000 documents to evaluate the performance of our 
model: 


vor & RSSt, Y Lest = gel mMinibatcn(doc stream, Siz6=3000) 
7? & Cesk = Vect.Translorm( Test) 


O 


Por PEI heCULaCy: esol @= ChisSCore(%, test, Y_Cest) ) 
ACCUracy:- 0wo76 


As we can see, the accuracy of the model 1s approximately 88 percent, slightly below 
the accuracy that we achieved 1n the previous section using the grid search for 
hyperparameter tuning. However, out-of-core learning is very memory efficient and 
took less than a minute to complete. Finally, we can use the last 5,000 documents to 
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update our model: 


ee Gie = CLi. Portia! Lie Pest, VY -best) 


If you are planning to continue directly with Chapter 9, Embedding a Machine 
Learning Model into a Web Application, | recommend you keep the current Python 
session open. In the next chapter, we will use the model that we just trained to learn 
how to save it to disk for later use and embed it into a web application. 


Note 


A more modern alternative to the bag-of-words model is word2vec, an algorithm 
that Google released in 2013 (Efficient Estimation of Word Representations in 
Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, arXiv preprint 
arXiv:1301.3781, 20/3). The word2vec algorithm is an unsupervised learning 
algorithm based on neural networks that attempts to automatically learn the 
relationship between words. The idea behind word2vec is to put words that have 
similar meanings into similar clusters, and via clever vector-spacing, the model can 
reproduce certain words using simple vector math, for example, king — man + 
woman = queen. 


The original C-implementation with useful links to the relevant papers and 
alternative implementations can be found at https://code.google.com/p/word2vec/. 
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Topic modeling with Latent Dirichlet 
Allocation 


Topic modeling describes the broad task of assigning topics to unlabelled text 
documents. For example, a typical application would be the categorization of 
documents 1n a large text corpus of newspaper articles where we don't know on 
which specific page or category they appear in. In applications of topic modeling, we 
then aim to assign category labels to those articles—for example, sports, finance, 
world news, politics, local news, and so forth. Thus, in the context of the broad 
categories of machine learning that we discussed in Chapter 1, Giving Computers the 
Ability to Learn from Data, we can consider topic modeling as a clustering task, a 
subcategory of unsupervised learning. 


In this section, we will introduce a popular technique for topic modeling called 
Latent Dirichlet Allocation (LDA). However, note that while Latent Dirichlet 
Allocation is often abbreviated as LDA, it is not to be confused with Linear 
discriminant analysis, a supervised dimensionality reduction technique that we 
introduced in Chapter 5, Compressing Data via Dimensionality Reduction. 


Note 


LDA 1s different from the supervised learning approach that we took in this chapter 
to classify movie reviews as positive and negative. Thus, if you are interested in 
embedding scikit-learn models into a web application via the Flask framework using 
the movie reviewer as an example, please feel free to jump to the next chapter and 
revisit this standalone section on topic modeling later on. 
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Decomposing text documents with LDA 


Since the mathematics behind LDA 1s quite involved and requires knowledge about 
Bayesian inference, we will approach this topic from a practitioner's perspective and 
interpret LDA using layman's terms. However, the interested reader can read more 
about LDA in the following research paper: Latent Dirichlet Allocation, David M. 
Blei, Andrew Y. Ng, and Michael I. Jordan, Journal of Machine Learning Research 
3, pages: 993-1022, Jan 2003. 


LDA 1s a generative probabilistic model that tries to find groups of words that appear 
frequently together across different documents. These frequently appearing words 
represent our topics, assuming that each document is a mixture of different words. 
The input to an LDA 1s the bag-of-words model we discussed earlier in this chapter. 
Given a bag-of-words matrix as input, LDA decomposes it into two new matrices: 


e A document to topic matrix 
e A word to topic matrix 


LDA decomposes the bag-of-words matrix 1n such a way that if we multiply those 
two matrices together, we would be able to reproduce the input, the bag-of-words 
matrix, with the lowest possible error. In practice, we are interested in those topics 
that LDA found in the bag-of-words matrix. The only downside may be that we must 
define the number of topics beforehand—the number of topics is a hyperparameter 
of LDA that has to be specified manually. 
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LDA with scikit-learn 


In this subsection, we will use the LatentDirichletAllocation class implemented 
in scikit-learn to decompose the movie review dataset and categorize it into different 
topics. In the following example, we restrict the analysis to 10 different topics, but 
readers are encouraged to experiment with the hyperparameters of the algorithm to 
explore the topics that can be found in this dataset further. 


First, we are going to load the dataset into a pandas DataFrame using the local 
movie data.csv file of the movie reviews that we have created at the beginning of 
this chapter: 


>>> import pandas as pd 
eer OF = DO.ead, ‘CSsv (*MOvle GCala.Cev”, SicCoding—"Utrt-2-) 


Next, we are going to use the already familiar countVectorizer to create the bag-of- 
words matrix as input to the LDA. For convenience, we will use scikit-learn's built- 
in English stop word library via stop words='english': 


yee LOM. Gk learn. Ga Ure CxeracCciOn.lexe 2NpOr. Count vec lori Zer 


Zor. COUNE. = COUNT VSCLOrI Zeer (Stop words="englisn’, 
max Of=.1, 

jae max features=5000) 

poP kK. = COUNT.f4t transtorm(ar | “teview’ | «values 


Notice that we set the maximum document frequency of words to be considered to 
10 percent (max df=.1) to exclude words that occur too frequently across 
documents. The rationale behind the removal of frequently occurring words 1s that 
these might be common words appearing across all documents and are therefore less 
likely associated with a specific topic category of a given document. Also, we 
limited the number of words to be considered to the most frequently occurring 5,000 
words (max features=5000), to limit the dimensionality of this dataset so that it 
improves the inference performed by LDA. However, both max df=.1 and 

max features=5000 are hyperparameter values that I chose arbitrarily, and readers 
are encouraged to tune them while comparing the results. 


The following code example demonstrates how to fit a LatentDirichletAllocation 
estimator to the bag-of-words matrix and infer the 10 different topics from the 
documents (note that the model fitting can take up to five minutes or more on a 
laptop or standard desktop computer): 
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>>> from sklearn.decomposition import LatentDirichletAllocation 
eer iGa = imaventDar i chlecAl location (ma toprcs=10, 
Pando Ss -ele=iZ4, 

ee learning Met ao0="hatch” ) 
Poe co Gop Lee = Oa. Le Cienorornt 


By setting learning method='batch', we let the 1da estimator do its estimation 
based on all available training data (the bag-of-words matrix) 1n one iteration, which 
is Slower than the alternative 'online' learning method but can lead to more 
accurate results (setting learning method='online' 1S analogous to online or mini- 
batch learning that we discussed in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification, and in this chapter). 


Note 


The scikit-learn library's implementation of LDA uses the Expectation- 
Maximization (EM) algorithm to update its parameter estimates iteratively. We 
haven't discussed the EM algorithm 1n this chapter, but if you are curious to learn 
more, please see the excellent overview on Wikipedia 

(https://en. wikipedia. org/wiki/Expectation—maximization algorithm) and the 
detailed tutorial on how it is used in LDA in Colorado Reed's tutorial, Latent 
Dirichlet Allocation: Towards a Deeper Understanding, which is freely available at 


http://obphio.us/pdfs/lda_tutorial.pdf. 


After fitting the LDA, we now have access to the components_ attribute of the 1da 
instance, which stores a matrix containing the word importance (here, 5000) for each 
of the 10 topics 1n increasing order: 


Jer dds COMPONeNts. «shape 
(10; 2000) 


To analyze the results, let's print the five most important words for each of the 10 
topics. Note that the word importance values are ranked 1n increasing order. Thus, to 
print the top five words, we need to sort the topic array in reverse order: 


woe Th OD WOLGs = 2) 
Per Leacure Nemes — CoOune.9er Peature names) 
Por LOL CODLC. 10x, TOpic in enumerere (da. Components. ): 
PriIneA*TOp1e. «ae" ~ (LOD1C ax ua )) 
print(" ".join([feature names[i] 
fOr 1. 2m TOpile.arosor::() \ 
=, Top wores + Leek) s).4 
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TOD LC. 13 
worst minutes awful script stupid 


TOPLG: Zs 

family mother father children girl 
LOD LGC. 3: 

american war dvd music tv 

Opie: AG 

human audience cinema art sense 
Oiodse. 5 

police guy car dead murder 

TOpLe 6% 

horror house sex girl woman 

Lome: 7. 

role performance comeay acLtor perrormances 
TOPIC 38: 

series episode war episodes tv 
Topic 9% 

book version original read novel 
Topic. iO 


aceLOn. Tighe quy <quys cool 


Based on reading the five most important words for each topic, we may guess that 
the LDA identified the following topics: 


ad 


Generally bad movies (not really a topic category) 
Movies about families 

War movies 

Art movies 

Crime movies 

Horror movies 

Comedy movies 

Movies somehow related to TV shows 

Movies based on books 

Action movies 


Se oss 


ad 


To confirm that the categories make sense based on the reviews, let's plot three 
movies from the horror movie category (horror movies belong to category 6 at index 
position 5): 


woo MOPPOr = 2 LOp1Cs lt, O]-«argscerl,) [2 i41) 
por LOL 10er 10x; MOVie 2Ox an enumerate ( NOrror | +4): 
print('\nHorror movie #%d:' % (iter idx + 1)) 


Pre (at. Levrew | (mMovre 1Ox | [F500]. acu?) 
Horror movie +L: 
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House of Dracula works from the same baSic premise as House of 
Frankenstein from the year before; namely that Universal's three most 
famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are 
appearing in the movie together. Naturally, the film is rather messy 
therefore, but the fact that 


Horror movie #2: 

Okay, what the hell kind of TRASH have I been watching now? "The 
Witches' Mountain" has got to be one of the most incoherent and insane 
Spanish exploitation flicks ever and yet, at the same time, it's also 
strangely compelling. There's absolutely nothing that makes sense here 
and I even doubt there 


Horror movie #3: 

<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a 
total freakfest from start to finish. A fun freakfest at that, but at 
times it was a tad too reliant on kitsch rather than the horror. The 
story is difficult to summarize succinctly: a carefree, normal teenage 
Girl Starts Coming Tac 


Using the preceding code example, we printed the first 300 characters from the top 
three horror movies, and we can see that the reviews—even though we don't know 
which exact movie they belong to—sound like reviews of horror movies (however, 
one might argue that Horror movie #2 could also be a good fit for topic category I: 
Generally bad movies). 
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Summary 


In this chapter, we learned how to use machine learning algorithms to classify text 
documents based on their polarity, which is a basic task 1n sentiment analysis in the 
field of NLP. Not only did we learn how to encode a document as a feature vector 
using the bag-of-words model, but we also learned how to weight the term frequency 
by relevance using tf-idf. 


Working with text data can be computationally quite expensive due to the large 
feature vectors that are created during this process; in the last section, we learned 
how to utilize out-of-core or incremental learning to train a machine learning 
algorithm without loading the whole dataset into a computer's memory. 


Lastly, we introduced the concept of topic modeling using LDA to categorize the 
movie reviews into different categories in unsupervised fashion. 


In the next chapter, we will use our document classifier and learn how to embed it 
into a web application. 
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Chapter 9. Embedding a Machine 
Learning Model into a Web 
Application 


In the previous chapters, you learned about the many different machine learning 
concepts and algorithms that can help us with better and more efficient decision- 
making. However, machine learning techniques are not limited to offline 
applications and analysis, and they can be the predictive engine of your web 
services. For example, popular and useful applications of machine learning models 
in web applications include spam detection in submission forms, search engines, 
recommendation systems for media or shopping portals, and many more. 


In this chapter, you will learn how to embed a machine learning model into a web 
application that can not only classify, but also learn from data in real time. The 
topics that we will cover are as follows: 


e Saving the current state of a trained machine learning model 

e Using SQLite databases for data storage 

e Developing a web application using the popular Flask web framework 
e Deploying a machine learning application to a public web server 
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Serlalizing fitted scikit-learn estimators 


Training a machine learning model can be computationally quite expensive, as we 
have seen in Chapter 8, Applying Machine Learning to Sentiment Analysis. Surely 
we don't want to train our model every time we close our Python interpreter and 
want to make a new prediction or reload our web application? One option for model 
persistence 1s Python's in-built pickle module 
(https://docs.python.org/3.6/library/pickle.html), which allows us to serialize and 
deserialize Python object structures to compact bytecode so that we can save our 
classifier in its current state and reload it 1f we want to classify new samples, without 
needing the model to learn from the training data all over again. Before you execute 
the following code, please make sure that you have trained the out-of-core logistic 
regression model from the last section of Chapter 8, Applying Machine Learning to 
Sentiment Analysis and have it ready 1n your current Python session: 


>>> import pickle 
>>> import os 
Poe Gest = OSspacn.)01n "“mMmovicCclassiticr’, “pki objecue") 
Poo TE MOU. OS.palh.exists (Gest) * 
os.makedirs (dest) 


>>> pickle.dump(stop, 
open(os.path.join(dest, 'stopwords.pkl'),'wb'), 
eas protocol=4) 
Poo pickle .,Oump (CL, 
open(os.path.join(dest, 'classifier.pkl'), '‘wb'), 
protocol=4) 


Using the preceding code, we created a movieclassifier directory where we will 
later store the files and data for our web application. Within this movieclassifier 
directory, we created a pkl_ objects subdirectory to save the serialized Python 
objects to our local drive. Via the dump method of the pickle module, we then 
serialized the trained logistic regression model as well as the stop word set from the 
Natural Language Toolkit (NLTK) library, so that we don't have to install the 
NLTK vocabulary on our server. 


The dump method takes as its first argument the object that we want to pickle, and for 
the second argument we provided an open file object that the Python object will be 

written to. Via the wb argument inside the open function, we opened the file in binary 
mode for pickle, and we set protoco1=4 to choose the latest and most efficient pickle 


WOW! eBook 
www.wowebook.org 


protocol that has been added to Python 3.4, which is compatible with Python 3.4 or 
newer. If you have problems using protoco1=4, please check whether you are using 
the latest Python 3 version. Alternatively, you may consider choosing a lower 
protocol number. 


Note 


Our logistic regression model contains several NumPy arrays, such as the weight 
vector, and a more efficient way to serialize NumPy arrays is to use the alternative 
joblib library. To ensure compatibility with the server environment that we will use 
in later sections, we will use the standard pickle approach. If you are interested, you 
can find more information about job1ib at http://pythonhosted.org/joblib/. 


We don't need to pickle HashingVectorizer, since it does not need to be fitted. 
Instead, we can create a new Python script file from which we can import the 
vectorizer into our current Python session. Now, copy the following code and save it 
as vectorizer.py In the movieclassifier directory: 


From. SkLearn. ealure extractiOnw«Cext Import HashingVecrorizer 
import re 

LMpOre. -Os 

import pickle 


Cul OL = Conpech.deieme |, Tie 4 

stop = pickle.load (open ( 
OS~palhsjJoun (cur dix, 
"Dik Oe eCls*y 
“SLODWOLGS .DKL*),y “<b )) 


def tokenizer (text): 


text = re.sub('<[*>]*>', '"', text) 
emoticons = re.findall(' (?::|;7]=) (2:-) ?(?:\) J/\C|/DIP)', 
text. lower () ) 
text = re.sub('[\W]+', ' ', text.lower()) \ 
+ * 1. )01m(emoOur.cons)..replace,*=", %*) 
tokenized = [w for w in text.split() if w not in stop] 


return tokenized 


VeECL = HastingVeclOrlzer (csecode Srror=” 19nore, 
i Pea Uae a2 "Ze 
DErSeDrOCessor=None, 
tokenizer=tokenizer) 


After we have pickled the Python objects and created the vectorizer.py file, it 
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would now be a good idea to restart our Python interpreter or [Python Notebook 
kernel to test if we can deserialize the objects without error. 


Note 


However, please note that unpickling data from an untrusted source can be a 
potential security risk, since the pickle module is not secured against malicious 
code. Since pickle was designed to serialize arbitrary objects, the unpickling 
process will execute code that has been stored in a pickle file. Thus, if you receive 
pickle files from an untrusted source (for example, by downloading them from the 
internet), please proceed with extra care and unpickle the items in a virtual 
environment and/or on a non-essential machine that does not store important data 
that no one except you should have access to. 


From your Terminal, navigate to the movieclassifier directory, start a new Python 
session and execute the following code to verify that you can import the vectorizer 
and unpickle the classifier: 


>>> import pickle 
>>> import re 
>>> import os 
>>> from vectorizer import vect 
>>> clf = pickle.load (open ( 
OS.Patis JOrm(* pK ob ecus", 
"classifier.pkl'), '‘rb')) 


After we have successfully loaded the vectorizer and unpickled the classifier, we 
can now use these objects to preprocess document samples and make predictions 
about their sentiment: 


>>> import numpy as np 
>>> label = {0:'negative', 1:'"positive'} 


>>> example = ['I love this movie'] 

>>> X = vect.transform(example) 

>>> print('Prediction: %s\nProbability: %.2£%%' %\ 
(label[clf.predict (X) [0]], 

ee NpsMex(ClLE,predi cr proba (x)) * 100) ) 

Prediction: positive 

Probability: 91.562 


Since our classifier returns the class labels as integers, we defined a simple Python 
dictionary to map these integers to their sentiment. We then used 
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HashingVectorizer to transform the simple example document into a word vector x. 
Finally, we used the predict method of the logistic regression classifier to predict 
the class label, as well as the predict proba method to return the corresponding 
probability of our prediction. Note that the predict proba method call returns an 
array with a probability value for each unique class label. Since the class label with 
the largest probability corresponds to the class label that is returned by the predict 
call, we used the np.max function to return the probability of the predicted class. 
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Setting up an SQLite database for data 
Storage 


In this section, we will set up a simple SQLite database to collect optional feedback 
about the predictions from users of the web application. We can use this feedback to 
update our classification model. SQLite 1s an open source SQL database engine that 
doesn't require a separate server to operate, which makes it ideal for smaller projects 
and simple web applications. Essentially, a SQLite database can be understood as a 

single, self-contained database file that allows us to directly access storage files. 


Furthermore, SQLite doesn't require any system-specific configuration and 1s 
supported by all common operating systems. It has gained a reputation for being very 
reliable as it is used by popular companies such as Google, Mozilla, Adobe, Apple, 
Microsoft, and many more. If you want to learn more about SQLite, I recommend 
you visit the official website at http://www.sqlite.org. 


Fortunately, following Python's batteries included philosophy, there is already an 
API in the Python standard library, sqlite3, which allows us to work with SQLite 
databases (for more information about sqlite3, please visit 


https://docs.python.org/3.6/library/sqlite3 html). 


By executing the following code, we will create a new SQLite database inside the 
movieclassifier directory and store two example movie reviews: 


>>> import sqlite3 
>>> import os 


>>> if os.path.exists('reviews.sqlite'): 
oe os.remove ('reviews.sqlite') 
>>> conn = sqlite3.connect ('reviews.sgqlite') 
ao CG = COnn.cursor() 
>>> c.execute('CREATE TABLE review db'\ 
' (review TEXT, sentiment INTEGER, date TEXT) ') 


>>> examplel = 'I love this movie! 
>>> c.execute ("INSERT INTO review db"\ 
" (review, sentiment, date) VALUES"\ 
" (2, 2, DATETIME ('now'))", (examplel, 1)) 


>>> example2 = 'IT disliked this movie' 


WOW! eBook 
www.wowebook.org 


>>> c.execute ("INSERT INTO review db"\ 

" (review, sentiment, date) VALUES"\ 
er " (2, ?, DATETIME ('now'))", (example2, Q)) 
>>> conn.commit () 
2 o> -COnn,. CLOSE () 


Following the preceding code example, we created a connection (conn) to a SQLite 
database file by calling the connect method of the sqlite3 library, which created 
the new database file reviews. sqlite in the movieclassifier directory if 1t didn't 
already exist. Please note that SQLite doesn't implement a replace function for 
existing tables; you need to delete the database file manually from your file browser 
if you want to execute the code a second time. 


Next, we created a cursor via the cursor method, which allows us to traverse over 
the database records using the versatile SQL syntax. Via the first execute call, we 
then created a new database table, review db. We used this to store and access 
database entries. Along with review db, we also created three columns in this 
database table: review, sentiment, and date. We used these to store two example 
movie reviews and respective class labels (sentiments). 


Using the DATETIME ('now') SQL command, we also added date and timestamps to 
our entries. In addition to the timestamps, we used the question mark symbols (?) to 
pass the movie review texts (examplel and example2) and the corresponding class 
labels (2 and 0) as positional arguments to the execute method, as members of a 
tuple. Lastly, we called the commit method to save the changes that we made to the 
database and closed the connection via the close method. 


To check if the entries have been stored in the database table correctly, we will now 
reopen the connection to the database and use the SQL sELEcT command to fetch all 
rows in the database table that have been committed between the beginning of the 
year 2017 and today: 


>>> conn = sqlite3.connect ('reviews.sgqlite') 

>>> C = conn.cursor () 

>>> c.execute ("SELECT * FROM review dob WHERE date"\ 

ea ~ BER IWEEN: *2017=01=-01 00700700" AND: DATETIME ( *now') ™) 
-o er LESULES = -C.terCchall() 


2o> CONN. close: () 

>>> print (results) 

[¢('l Love this movie’, 1, '2017-04-24 00:14:38'), 
('T disliked this movie', QO, ‘2017-04-24 00:14:38') 
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Alternatively, we could also use the free Firefox browser plugin SQLite Manager 
(available at https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/), which 
offers a nice GUI interface for working with SQLite databases, as shown 1n the 
following figure: 


SQLite Manager - /Users/sebastian/Desktop/chO9/reviews.sqlite 


] x [4 = £ foo © Py Fae inl re Directory > (Select Profile Database) Go : 
rovinnmiatiine Structure Browse & Search Execute SQL DB Settings 


» Master Table (1) 
v Tables (1) TABLE review_db Search Show All Add 


Vv review_db ' -rowid review sentiment date 
review : |i love this movie (1 |2017-04-24 00:14:38 


sentiment ¢ | iz |i disliked this movie —_|0 |2017-04-24 00:14:38 
date 


» Views (0) 
® Indexes (0) 
® Triggers (0) 





SQLite 3.17.0 Gecko53.0 £0.8.3.1-signed.|-sianed Shared Number of files in selected directory: 8 
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Developing a web application with 
Flask 


Having prepared the code for classifying movie reviews in the previous subsection, 
let's discuss the basics of the Flask web framework to develop our web application. 
After Armin Ronacher's initial release of Flask in 2010, the framework has gained 
huge popularity over the years, and examples of popular applications that make use 
of Flask include LinkedIn and Pinterest. Since Flask is written in Python, it provides 
us Python programmers with a convenient interface for embedding existing Python 
code, such as our movie classifier. 


Note 


Flask 1s also known as a microframework, which means that its core is kept lean 
and simple but can be easily extended with other libraries. Although the learning 
curve of the lightweight Flask API is not nearly as steep as those of other popular 
Python web frameworks, such as Django, I encourage you to take a look at the 
official Flask documentation at http://flask.pocoo.org/docs/0.12/ to learn more about 
its functionality. 


If the Flask library is not already installed in your current Python environment, you 
can simply install it via conda or pip from your Terminal (at the time of writing, the 
latest stable release was version 0.12.1): 


conda install flask 
# or: pip install flask 
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Our first Flask web application 


In this subsection, we will develop a very simple web application to become more 
familiar with the Flask API before we implement our movie classifier. This first 
application we are going to build consists of a simple web page with a form field that 
lets us enter a name. After submitting the name to the web application, it will render 
it on a new page. While this 1s a very simple example of a web application, it helps 
with building intuition about how to store and pass variables and values between the 
different parts of our code within the Flask framework. 


First, we create a directory tree: 


tet Diack 2p > 17 


app-PY 
templates/ 
LLeESc app. emt 


The app. py file will contain the main code that will be executed by the Python 
interpreter to run the Flask web application. The templates directory is the directory 
in which Flask will look for static HTML files for rendering in the web browser. 
Let's now take a look at the contents of app. py: 


from flask amport. Plask, render template 


app = Flask( name _) 
@app.route('/"') 
def index(): 
fecurn render Template * first app.htiml”) 


7... ene. => Wein “s 
app.run() 


After looking at the previous code example, let's discuss the individual pieces step 
by step: 


1. We ran our application as a single module; thus we initialized a new Flask 
instance with the argument name _ to let Flask know that it can find the 
HTML template folder (templates) in the same directory where it is located. 

2. Next, we used the route decorator (@app. route ('/"') ) to specify the URL that 
should trigger the execution of the index function. 

3. Here, our index function simply rendered the first app.html HTML file, 
which is located in the templates folder. 
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4. Lastly, we used the run function to only run the application on the server when 
this script is directly executed by the Python interpreter, which we ensured 
using the if statement with name == ' main '. 


Now, let's take a look at the contents of the first app.html file: 


<!doctype html> 
<em 1. 
<head> 
<title>First app</title> 
</head> 
<body> 
<div>Hi, this is my first Flask web app!</div> 
</body> 
</html> 


Note 


If you are not familiar with the HTML syntax yet, I recommend you visit 
https://developer.mozilla.org/en-US/docs/Web/HTML for useful tutorials for 
learning the basics of HTML. 


Here, we have simply filled an empty HTML template file with a <div> element (a 
block level element) that contains this sentence: Hi, this is my first Flask web 
app!. 


Conveniently, Flask allows us to run our applications locally, which is useful for 
developing and testing web applications before we deploy them on a public web 
server. Now, let's start our web application by executing the command from the 
Terminal inside the 1st flask app 1 directory: 


python3 app.py 


We should see a line such as the following displayed in the Terminal: 


* Running on http://127.0.0.1:5000/ 


This line contains the address of our local server. We can enter this address 1n our 
web browser to see the web application in action. If everything has executed 
correctly, we should see a simple website with the content Hi, this is my first 
Flask web app! as shown in the following figure: 
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@®oee < 127.0.0.1 


Hi, this is my first Flask web app! 
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Form validation and rendering 


In this subsection, we will extend our simple Flask web application with HTML 
form elements to learn how to collect data from a user using the WTForms library 
(https://wtforms.readthedocs.org/en/latest/), which can be installed via conda or pip: 


conda install wtforms 
# or pip install wtforms 


This web application will prompt a user to type in his or her name into a text field, as 
shown in the following screenshot: 


oe (Cf “LEA ARS 127.0.0.1 


What's your name? 


Sebastian] 


cay Hello 





After the submission button (Say Hello) has been clicked and the form 1s validated, a 
new HTML page will be rendered to display the user's name: 


eoe “(C[) * cy Ai £ : 127.0.0.1:5000/hello 


Hello Sebastian 
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Setting up the directory structure 


The new directory structure that we need to set up for this application looks like this: 


Ist flask app 2/ 

APP-PYy 

Static] 
style.css 

templates/ 
-Pormhel pers, neal 
Parse eappyntms 
hello.html 


The following are the contents of our modified app. py file: 


from flask import Flask, render template, request 
from wtforms import Form, TextAreaField, validators 


app = Flask( name _) 


Class HelloForm(Form) : 
sayhello = TextAreaField('', [validators.DataRequired() ]) 


@app.route('/') 
def index(): 
form = HelloForm(request.form) 
fecurn Lender template’ first eappsntm.”, Lform=Lorm) 


@app.route('/hello', methods=['POST']) 
Ger felio.) = 
form = HelloForm(request.form) 
if request.method == 'POST' and form.validate(): 
name = regquest.form['sayhello'] 
ECLUIN Leneer Template( Nello. Nim’, Name=neme) 
Fecuin Pender TCenplate{ first app.neml’, Lorm=form) 
if .aene.. ==". Main °°: 
app.run (debug=True) 


Let's discuss what the previous code does step by step: 


1. Using wt forms, we extended the index function with a text field that we will 
embed in our start page using the TextAreaField class, which automatically 
checks whether a user has provided valid input text or not. 

2. Furthermore, we defined a new function, hello, which will render an HTML 
page hello.html after validating the HTML form. 
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3. Here, we used the post method to transport the form data to the server in the 
message body. Finally, by setting the debug=True argument inside the app. run 
method, we further activated Flask's debugger. This is a useful feature for 
developing new web applications. 


Implementing a macro using the Jinja2 templating engine 


Now, we will implement a generic macro in the formhelpers.html file via the 
Jinja2 templating engine, which we will later import in our first app.html file to 
render the text field: 


{s macro render field(field) %} 
<dt>{{ field.label }} 
<dd>{{ field(**kwargs) |safe }} 
% 1f field.errors %} 
<ul class=errors> 
{% for error in field.errors 4%} 
<li>{{ error }}</li> 
% endfor 3%} 
</ul> 
% endif 3%} 
</dd> 
<jac> 


(e) 


%6 endmacro %} 


An in-depth discussion about the Jinja2 templating language is beyond the scope of 
this book. However, you can find a comprehensive documentation of the Jinja2 


syntax at http://jinja.pocoo.org. 
Adding style via CSS 


Next, we set up a simple Cascading Style Sheet (CSS) file, style.css, to 
demonstrate how the look and feel of HTML documents can be modified. We have 
to save the following CSS file, which will simply double the font size of our HTML 
body elements, in a subdirectory called static, which 1s the default directory where 
Flask looks for static files such as CSS. The file content 1s as follows: 


body { 
font-size: 2em; 


} 


The following are the contents of the modified first app.htmi file that will now 
render a text form where a user can enter a name: 
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<!doctype html> 
<html> 
<head> 
<title>First app</title> 
Link Lela "Stylesheet brer="441 Url Tort sretic; 


filename='style.css') }}"> 
</head> 
<body> 
te [Lom Tormhel pers. nim. amport: render tiehkd a] 


<div>What's your name?</div> 
<form method=post action="/hello"> 
<dl> 
it ener Delores aynello;. 73 
7 colle 
<Inputc type=submit. value="Say Hello” name="submit btn'*> 
</form> 
</body> 
</html> 


In the header section of first app.html, we loaded the CSS file. It should now alter 
the size of all text elements in the HTML body. In the HTML body section, we 
imported the form macro from formhelpers.html, and we rendered the sayhello 
form that we specified in the app. py file. Furthermore, we added a button to the 
same form element so that a user can submit the text field entry. 


Creating the result page 


Lastly, we will create a hello.html file that will be rendered via the 

render template('hello.html', name=name) line return inside the hello 
function, which we defined 1n the app. py script to display the text that a user 
submitted via the text field. The file content is as follows: 


<!doctype html> 
<link 
<head> 
<title>First app</title> 
Link Del="stylesheee™ Nrer="{{ url fort static’, 


filename='style.css') }}"> 
</head> 
<body> 
<div>Hello {{ name }}</div> 
</body> 
7 Gone 


Having set up our modified Flask web application, we can run it locally by executing 
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the following command from the application's main directory, and we can view the 
result in our web browser at http://127.0.0.1:5000/: 


python3 app.py 


Note 


If you are new to web development, some of those concepts may seem very 
complicated at first sight. In that case, I encourage you to simply set up the 
preceding files in a directory on your hard drive and examine them closely. You will 
see that the Flask web framework is relatively straightforward and much simpler 
than it might initially appear! Also, for more help, don't forget to consult the 
excellent Flask documentation and examples at http://flask.pocoo.org/docs/0.12/. 
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Turning the movie review classifier into 
a web application 


Now that we are somewhat familiar with the basics of Flask web development, let's 
advance to the next step and implement our movie classifier into a web application. 
In this section, we will develop a web application that will first prompt a user to 
enter a movie review, as shown in the following screenshot: 


A | A (1) @ raschkas.pythonanywhere.com/ 


Please enter your movie review: 


| love this movie! 


Submit review 





After the review has been submitted, the user will see a new page that shows the 
predicted class label and the probability of the prediction. Furthermore, the user will 
be able to provide feedback about this prediction by clicking on the Correct or 
Incorrect button, as shown 1n the following screenshot: 


WOW! eBook 
www.wowebook.org 


raschkas.pythonanywhere.com/results © 


Your movie review: 


I love this movie! 


Prediction: 


This movie review is positive (probability: 90.86%). 


Correct Incorrect 


Submit another review 





If a user clicked on either the Correct or Incorrect button, our classification model 
will be updated with respect to the user's feedback. Furthermore, we will also store 
the movie review text provided by the user as well as the suggested class label, 
which can be inferred from the button click, in a SQLite database for future 
reference. (Alternatively, a user could skip the update step and click the Submit 
another review button to submit another review. ) 


The third page that the user will see after clicking on one of the feedback buttons is a 
simple thank you screen with a Submit another review button that redirects the user 
back to the start page. This is shown in the following screenshot: 


WOW! eBook 
www.wowebook.org 


oe 
ie 


(ir) raschkas.pythonanywhere.com/thanks 


Thank you for your feedback! 


Submit another review 





Note 


Before we take a closer look at the code implementation of this web application, I 
encourage you to take a look at the live demo that I uploaded at 
http://raschkas.pythonanywhere.com to get a better understanding of what we are 
trying to accomplish 1n this section. 
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Files and folders — looking at the directory tree 


To start with the big picture, let's take a look at the directory tree that we are going to 
create for this movie classification application, which is shown here: 


app.py 
v |) pkl_objects 


=e classifier. pk! 
@# stopwords.pk! 
—\ reviews.sqlite 
v (8 static 
a) style.css 


v | templates 
_formhelpers.html 
results.html 
reviewform.html 
thanks.html 

vectorizer.py 





In the previous section of this chapter, we already created the vectorizer.py file, 
the SQLite database reviews.sqlite, and the pkl objects subdirectory with the 
pickled Python objects. 


The app. py file in the main directory is the Python script that contains our Flask 
code, and we will use the review. sqlite database file (which we created earlier in 
this chapter) to store the movie reviews that are being submitted to our web 
application. The templates subdirectory contains the HTML templates that will be 
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rendered by Flask and displayed 1n the browser, and the static subdirectory will 
contain a simple CSS file to adjust the look of the rendered HTML code. 


Note 


A separate directory containing the movie review classifier application with the code 
discussed in this section 1s provided with the code examples for this book, which you 
can either obtain directly from Packt or download from GitHub at 


https://github.com/rasbt/python-machine-learning-book-2nd-edition/. The code in 


this section can be found in the. . ./code/ch09/movieclassifier subdirectory. 
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Implementing the main application as app.py 


Since the app. py file 1s rather long, we will conquer it in two steps. The first section 
of app.py imports the Python modules and objects that we are going to need, as well 
as the code to unpickle and set up our classification model: 


from flask import Flask, render template, request 
from wtforms import Form, TextAreaField, validators 
EMO, PLC le 

import sqlite3 

import os 

import numpy as np 


# import HashingVectorizer from local dir 
from vectorizer import vect 


app = Flask({ mame } 


HHHPHHHEH Preparing the Classifier 

Cur Gir = OS 2palh.dirname( _Tite _) 

CLE = pickle.104a0 (Open (Os.peln. Join (Cur dir, 
"pe OD)SCts; 
"classifier.pkl'), 'rb')) 

Gl: = OSs belis JOlLI(Cul Ci, “Leyte vs.ed ice) 


def classify(document) : 
label = {0: 'negative', 1: 'positive'} 
X = vect.transform([document ] ) 
y = clf.predict (X) [0] 
proba = np.~max(clfi.preaice proba (x) 
return label[y], proba 


def train(document, y): 
X = vect.transform([document ] ) 


CLiwpal tio. fic, i.) 


Ook soqlice eGncry (pach, document, Vy): 


conn = sqlite3.connect (path) 

Oo = CoOnm.Ccursor () 

c.execute ("INSERT INTO review db (review, sentiment, date)"\ 
" VALUES (?, ?, DATETIME('now'))", (document, y)) 


conn.commit () 
conn.close() 


This first part of the app. py script should look very familiar to us by now. We 
simply imported the HashingVectorizer and unpickled the logistic regression 
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classifier. Next, we defined a classify function to return the predicted class label as 
well as the corresponding probability prediction of a given text document. The train 
function can be used to update the classifier, given that a document and a class label 
are provided. 


Using the sqlite entry function, we can store a submitted movie review in our 
SQLite database along with its class label and timestamp for our personal records. 
Note that the c1f£ object will be reset to its original, pickled state if we restart the 
web application. At the end of this chapter, you will learn how to use the data that 
we collect in the SQLite database to update the classifier permanently. 


The concepts in the second part of the app.py script should also look quite familiar 
to us: 


HHPHHiFt Flask 
class ReviewForm(Form) : 
moviereview = TextAreaField('', 
[validators.DataRequired(), 
validators.length (min=15) ]) 


@app.route('/"') 
def index(): 
form = ReviewForm(request. form) 
return render template("reviewrorm.natml’, form=form) 


@app.route('/results', methods=['POST']) 
def results(): 
form = ReviewForm(request. form) 
1f request.method == 'POST!' and form.validate(): 
review = request.form['moviereview' ] 

y, proba = classify (review) 

LeLULm Beneer Lemp late (* resulLts.Dum”, 
content=review, 
prediction=y, 
PrebabilLity=round(proba* 100, 2) ) 

Feturm Fencer Templave (~LevLewrLorms atm.” ,», LOrm=Lorm) 


@app.route('/thanks', methods=['POST']) 

def feedback (): 
Peccback =: Pequese.form| ecobpack, DUECOn® | 
review = request.form['review'] 
prediction = request.form['prediction' |] 


Li babel = 4 "Negatives 0, “positive = iy 


y = anv label. |pred1el10n | 
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1f feedback == 'Incorrect': 
y = int (not (y) ) 
train(review, y) 
SQLite. Sntry (cb, Teview, yy) 
return rencer Cemplare( Tthanks.1tmk” ) 


if fame == * Mein ”*: 
app. run (debug=True) 


We defined a ReviewForm Class that instantiates a TextAreaField, which will be 
rendered in the reviewform.html template file (the landing page of our web 
application). This, in turn, is rendered by the index function. With the 
validators.length(min=15) parameter, we require the user to enter a review that 
contains at least 15 characters. Inside the results function, we fetch the contents of 
the submitted web form and pass it on to our classifier to predict the sentiment of the 
movie classifier, which will then be displayed in the rendered results. html 
template. 


The feedback function, which we implemented in app.py in the previous subsection, 
may look a little bit complicated at first glance. It essentially fetches the predicted 
class label from the results.htm1 template if a user clicked on the Correct or 
Incorrect feedback button, and transforms the predicted sentiment back into an 
integer class label that will be used to update the classifier via the train function, 
which we implemented in the first section of the app.py script. Also, a new entry to 
the SQLite database will be made via the sqlite entry function if feedback was 
provided, and eventually the thanks.htm1 template will be rendered to thank the 
user for the feedback. 


WOW! eBook 
www.wowebook.org 


Setting up the review form 


Next, let's take a look at the reviewform. html template, which constitutes the 
starting page of our application: 


<!doctype html> 
<ntm. > 
<head> 
<title>Movie Classification</title> 
<link rel="stylesheet" 
hret="1 4 Url 2Or( "Static, Lilename= sly te<css’) ji" 
</head> 
<DoOcy-> 


<h2>Please enter your movie review:</h2> 
(oc From " TOrmne pers .nimL” 2mpore render field a} 


<form method=post action="/results"> 
<dl> 
{{ render field(form.moviereview, cols='30', rows='10') }} 
to 
<div> 
<input type=submit value='Submit review' 
name='submit btn'> 
7 celia 
</form> 


</body> 
</html> 


Here, we simply imported the same formhelpers.html template that we defined in 
the Form validation and rendering section earlier in this chapter. The render field 
function of this macro is used to render a TextAreaField where a user can provide a 
movie review and submit it via the Submit review button displayed at the bottom of 
the page. This TextAreaField 1s 30 columns wide and 10 rows tall, and would look 
like this: 
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127.0.0.1 


Please enter your movie review: 





Submit review 
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Creating a results page template 


Our next template, results.html, looks a little bit more interesting: 


<!doctype html> 
<html> 
<head> 
<title>Movie Classification</title> 
<link rel="stylesheet" 
heet="44 (tl TOr( static’, f2lename= style.cos’) #1 "Ss 
</head> 
<body> 


<hs-Your movie reviewr<7h3> 
adivo44 Gontent }i<7diy7> 


<h3>Prediction:</h3> 
<div>This movie review is <strong>{{ prediction }}</strong> 
(probability: {{ probability }}%).</div> 


<017 120=*bulton™ > 
<form action="/thanks" method="post"> 
<input type=submit value='Correct'! 
Name=" Lecoback bultron > 
<input type=submit value="Incorrect' 
Name="Tecoback button’ 2 
<input type=hidden value='{{ prediction }}' 
name='prediction'!> 
<input type=hidden value='{{ content }}' name='review'> 
</form> 
</aiv> 


<ai7 1.0="bucvon’ > 
<form action="/"> 
<input type=submit value='Submit another review'> 
</tToOrm-> 

</div> 


</body> 
</html> 


First, we inserted the submitted review, as well as the results of the prediction, in the 
corresponding fields {{ content }},{{ prediction }},and{{ probability }}. 
You may notice that we used the {{ content }} and {{ prediction }} placeholder 
variables a second time in the form that contains the Correct and Incorrect buttons. 
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This 1s a workaround to post those values back to the server to update the classifier 
and store the review in case the user clicks on one of those two buttons. 


Furthermore, we imported a CSS file (style.css) at the beginning of the 
results.html file. The setup of this file is quite simple; 1t limits the width of the 
contents of this web application to 600 pixels and moves the Incorrect and Correct 
buttons labeled with the div ID button down by 20 pixels: 


body { 
Wloths600Gx; 
} 


LOULE On, 
bedding-Lops 2Z0px; 
j 


This CSS file is merely a placeholder, so please feel free to adjust it to adjust the 
look and feel of the web application to your liking. 


The last HTML file we will implement for our web application 1s the thanks. html 
template. As the name suggests, it simply provides a nice thank you message to the 
user after providing feedback via the Correct or Incorrect button. Furthermore, we 
will put a Submit another review button at the bottom of this page, which will 
redirect the user to the starting page. The contents of the thanks.html file are as 
follows: 


<!doctype html> 
<html> 
<head> 
<title>Movie Classification</title> 
<link rel="stylesheet" 
hneet="{), Ul SOP" Sstarre", fileneme— st Jie.ecso’) ja" 
</head> 
<Docy-> 


<h3>Thank you for your feedback!</h3> 


<div id='button'!> 
<fOrm aAch1.0on="7"> 
<input type=submit value='Submit another review'> 
</form> 
</di-7 


</body> 
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</html> 


Now, it would be a good idea to start the web application locally from our Terminal 
via the following command before we advance to the next subsection and deploy it 
on a public web server: 


python3 app.py 


After we have finished testing our application, we also shouldn't forget to remove the 
debug=True argument in the app. run() command of our app. py script. 
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Deploying the web application to a 
public server 


After we have tested the web application locally, we are now ready to deploy our 
web application onto a public web server. For this tutorial, we will be using the 
PythonAnywhere web hosting service, which specializes in the hosting of Python 
web applications and makes it extremely simple and hassle-free. Furthermore, 
PythonAnywhere offers a beginner account option that lets us run a single web 
application free of charge. 
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Creating a PythonAnywhere account 


To create a new PythonAnywhere account, we visit the website at 
https://www.pythonanywhere.com/ and click on the Pricing & signup link that is 
located in the top-right corner. Next, we click on the Create a Beginner account 
button where we need to provide a username, password, and valid email address. 
After we have read and agreed to the terms and conditions, we should have a new 
account. 


Unfortunately, the free beginner account doesn't allow us to access the remote server 
via the SSH protocol from our Terminal. Thus, we need to use the PythonAnywhere 
web interface to manage our web application. But before we can upload our local 
application files to the server, we need to create a new web application for our 
PythonAnywhere account. After we click on the Dashboard button in the top-right 
corner, we have access to the control panel shown at the top of the page. Next, we 
click on the Web tab that is now visible at the top of the page. We proceed by 
clicking on the +Add a new web app button on the left, which lets us create a new 
Python 3.5 Flask web application that we name movieclassifier. 
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Uploading the movie classifier application 


After creating a new application for our PythonAnywhere account, we head over to 
the Files tab, to upload the files from our local movieclassifier directory using the 
PythonAnywhere web interface. After uploading the web application files that we 
created locally on our computer, we should have a movieclassifier directory in our 
PythonAnywhere account. It contains the same directories and files as our local 
movieclassifier directory has, as shown in the following screenshot: 


@ee < ca (@) 1 pythonanywhere.com 


shir 
“SO OUT VQNUW nere Send feedback Forums Help Blog Dashboard Account Log out 


Consoles Files Web Schedule Databases 


/ home / raschkas / '‘@ movieclassifier [=||]Open Bash console here 3% full (16.4 MB of your 512.0 MB quota) 
Directories 


New directory 


_pycache_/ 
pki_obpjects/ 
Sstatic/ 


tempilates/ 
Files 
New file 


i app.py 
i reviews.salite 
& vectorizer.py 





Lastly, we head over to the Web tab one more time and click on the Reload 
<username>.pythonanywhere.com button to propagate the changes and refresh our 
web application. Finally, our web application should now be up and running and 
publicly available via <username>.pythonanywhere.com. 
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Note 


Troubleshooting 


Unfortunately, web servers can be quite sensitive to the tiniest problems in our web 
application. If you are experiencing problems with running the web application on 
PythonAnywhere and are receiving error messages 1n your browser, you can check 
the server and error logs, which can be accessed from the Web tab in your 
PythonAnywhere account, to better diagnose the problem. 
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Updating the movie classifier 


While our predictive model 1s updated on the fly whenever a user provides feedback 
about the classification, the updates to the c1f object will be reset if the web server 
crashes or restarts. If we reload the web application, the c1£ object will be 
reinitialized from the classifier.pk1 pickle file. One option to apply the updates 
permanently would be to pickle the c1f object once again after each update. 
However, this would become computationally very inefficient with a growing 
number of users, and could corrupt the pickle file if users provide feedback 
simultaneously. 


An alternative solution is to update the predictive model from the feedback data that 
is being collected in the SQLite database. One option would be to download the 
SQLite database from the PythonAnywhere server, update the c1£ object locally on 
our computer, and upload the new pickle file to PythonAnywhere. To update the 
classifier locally on our computer, we create an update.py script file in the 
movieclassifier directory with the following contents: 


import 
LMpOLrL 
import 
import 


pickle 
sqlite3 
numpy as np 
OSs 


# import HashingVectorizer from local dir 
from vectorizer import vect 


Get updeve model (cb pach, model, batch S17z6=10000) : 
conn. = SG l1be3).connecr (db. ‘patn) 


CG = COnm. Cursor () 
c.execute('SELECT * from review db') 


results = 


c.Tevcchmany (batch S276) 


while results: 


data = np.array(results) 

X = data[:, QO] 

y = data[:, 1].astype(int) 
classes = np.array([0, 1]) 
x trai, = Vect.trancsrornn(.) 


y, classes=classes) 
C.TeELChMany (batch Size) 


mogel..parGiat. TiCix Crain, 
results = 
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conn.close() 
return model 


Cul Oat =‘ OsS.perhsotneme( tae 2 
CLE = pickle.jt0ad(Open(OS. pal.) oi1n(cur dir, 
"pki ObjJeCcts”’y 
"classifier.pkl'), '‘'rb')) 
Glo: = Oe serie JO1(CU Wie, “Eevee .sd lice | 
Gilt = tpdaerve mocel(dbd paeth—ab, Model=clrt, bacen.size=10000) 
# Uncomment the following lines if you are sure that 
# you want to update your classifier.pkl file 


# permanently. 


# pickle.dump(clf, open(os.path.join(cur dir, 


ir 'pkl_ objects', 'classifier.pkl'), 'whb') 
it , protocol=4) 
Note 


A separate directory containing the movie review classifier application with the 
update functionality discussed in this chapter comes with the code examples for this 
book, which you can either obtain directly from Packt or download from GitHub at 
https://github.com/rasbt/python-machine-learning-book-2nd-edition/. The code in 
this section is located in the... /code/ch09/movieclassifier with update 
subdirectory. 


The update model function will fetch entries from the SQLite database in batches of 
10,000 entries at a time, unless the database contains fewer entries. Alternatively, we 
could also fetch one entry at a time by using fetchone instead of fetchmany, which 
would be computationally very inefficient. However, keep in mind that using the 
alternative fetchall method could be a problem if we are working with large 
datasets that exceed the computer or server's memory capacity. 


Now that we have created the update. py script, we could also upload it to the 
movieclassifier directory on PythonAnywhere, and import the update model 
function in the main application script app.py to update the classifier from the 
SQLite database every time we restart the web application. In order to do so, we just 
need to add a line of code to import the update model function from the update. py 
script at the top of app. py: 
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# import update function from local dir 
from update import update model 


We then need to call the update model function in the main application body: 


Le fame. == ”* Main. %% 
Clit = Upagate mogel.ab path—ab, 
model=clf, 
balcn. S17e=10000) 


As discussed, the modification in the previous code snippet will update the pickle 
file on PythonAnywhere. However, 1n practice, we do not often have to restart our 
web application, and it would make sense to validate the user feedback in the SQLite 
database prior to the update to make sure the feedback is valuable information for the 
classifier. 
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Summary 


In this chapter, you learned about many useful and practical topics that extend our 
knowledge of machine learning theory. You learned how to serialize a model after 
training and how to load it for later use cases. Furthermore, we created a SQLite 
database for efficient data storage and created a web application that lets us make our 
movie classifier available to the outside world. 


Throughout this book, we have really discussed a lot about machine learning 
concepts, best practices, and supervised models for classification. In the next 
chapter, we will take a look at another subcategory of supervised learning, regression 
analysis, which lets us predict outcome variables on a continuous scale, 1n contrast to 
the categorical class labels of the classification models that we have been working 
with so far. 
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Chapter 10. Predicting Continuous 
Target Variables with Regression 
Analysis 


Throughout the previous chapters, you learned a lot about the main concepts behind 
supervised learning and trained many different models for classification tasks to 
predict group memberships or categorical variables. In this chapter, we will dive into 
another subcategory of supervised learning: regression analysis. 


Regression models are used to predict target variables on a continuous scale, which 
makes them attractive for addressing many questions in science as well as 
applications in industry, such as understanding relationships between variables, 
evaluating trends, or making forecasts. One example would be predicting the sales of 
a company in future months. 


In this chapter, we will discuss the main concepts of regression models and cover the 
following topics: 


Exploring and visualizing datasets 

Looking at different approaches to implement linear regression models 
Training regression models that are robust to outliers 

Evaluating regression models and diagnosing common problems 
Fitting regression models to nonlinear data 
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Introducing linear regression 


The goal of linear regression is to model the relationship between one or multiple 
features and a continuous target variable. As discussed in Chapter 1, Giving 
Computers the Ability to Learn from Data, regression analysis 1s a subcategory of 
supervised machine learning. In contrast to classification—another subcategory of 
supervised learning—regression analysis aims to predict outputs on a continuous 
scale rather than categorical class labels. 


In the following subsections, we will introduce the most basic type of linear 
regression, simple linear regression, and relate it to the more general, multivariate 
case (linear regression with multiple features). 
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Simple linear regression 


The goal of simple (univariate) linear regression 1s to model the relationship 
between a single feature (explanatory variable x) and a continuous valued response 
(target variable y). The equation of a linear model with one explanatory variable is 
defined as follows: 


y = Wy + W, + 


Here, the weight wp represents the y-axis intercept and "is the weight coefficient 
of the explanatory variable. Our goal 1s to learn the weights of the linear equation to 
describe the relationship between the explanatory variable and the target variable, 
which can then be used to predict the responses of new explanatory variables that 
were not part of the training dataset. 


Based on the linear equation that we defined previously, linear regression can be 
understood as finding the best-fitting straight line through the sample points, as 
shown in the following figure: 
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This best-fitting line 1s also called the regression line, and the vertical lines from the 
regression line to the sample points are the so-called offsets or residuals—the errors 
of our prediction. 
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Multiple linear regression 


The special case of linear regression with one explanatory variable that we 
introduced in the previous subsection is also called simple linear regression. Of 
course, we can also generalize the linear regression model to multiple explanatory 
variables; this process is called multiple linear regression: 


y=WX, + Ww 


Hi 
- 7 ee os — > oo J +" 
X, +...4+ Wy, X_ =) W,X; = Ww" x 


7=() 


Ww. a a 
Here, © is the y-axis intercept with * m. 


The following figure shows how the two-dimensional, fitted hyperplane of a multiple 
linear regression model with two features could look: 


Target 





Feature 1 
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AS we can see, visualizing multiple linear regression fits in three-dimensional scatter 
plot are already challenging to interpret when looking at static figures. Since we have 
no good means of visualizing hyperplanes with two dimensions in a scatterplot 
(multiple linear regression models fit to datasets with three or more features), the 
examples and visualizations in this chapter will mainly focus on the univariate case, 
using simple linear regression. However, simple and multiple linear regression are 
based on the same concepts and the same evaluation techniques; the code 
implementations that we will discuss in this chapter are also compatible with both 
types of regression model. 
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Exploring the Housing dataset 


Before we implement our first linear regression model, we will introduce a new 
dataset, the Housing dataset, which contains information about houses in the suburbs 
of Boston collected by D. Harrison and D.L. Rubinfeld in 1978. The Housing dataset 
has been made freely available and 1s included 1n the code bundle of this book. The 
dataset has been recently removed from the UCI Machine Learning Repository but is 


available online at https://raw. githubusercontent.com/rasbt/python-machine- 


learning-book-2nd-edition/master/code/ch10/housing.data.txt. As with each new 
dataset, it is always helpful to explore the data through a simple visualization, to get 


a better feeling of what we are working with. 
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Loading the Housing dataset into a data frame 


In this section, we will load the Housing dataset using the pandas read _ csv function, 
which is fast and versatile—a recommended tool for working with tabular data 
stored in a plaintext format. 


The features of the 506 samples in the Housing dataset are summarized here, taken 
from the original source that was previously shared on 


https://archive.ics.uci.edu/ml/datasets/Housing: 


® CRIM: Per capita crime rate by town 

e zN: Proportion of residential land zoned for lots over 25,000 sq. ft. 

e iNbDus: Proportion of non-retail business acres per town 

e cHas: Charles River dummy variable (= | if tract bounds river; 0 otherwise) 

e nox: Nitric oxide concentration (parts per 10 million) 

e rm: Average number of rooms per dwelling 

e aGE: Proportion of owner-occupied units built prior to 1940 

e pis: Weighted distances to five Boston employment centers 

e rap: Index of accessibility to radial highways 

e Tax: Full-value property tax rate per $10,000 

® PTRATIO: Pupil-teacher ratio by town 

e B: 1000(Bk - 0.63)*2, where Bk is the proportion of [people of African 
American descent] by town 

e ~STAT: Percentage of lower status of the population 

e mepv: Median value of owner-occupied homes in $1000s 


For the rest of this chapter, we will regard the house prices (mEDv) as our target 
variable—the variable that we want to predict using one or more of the 13 
explanatory variables. Before we explore this dataset further, let us copy it from the 
UCI repository into a pandas DataFrame: 


>>> import pandas as pd 
>>> df = pd.read csv('https://raw.githubusercontent.com/rasbt/' 
"oython-machine-learning-book-Z2nd-edition' 
'/master/code/chl0/housing.data.txt', 
header=None, 
eo sep='\st') 
>>> O1~.COlUumnS = | CRIM’, *A2N", "“INDUS*, “CHAS”, 
'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV"'] 
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>>> dfi.head() 


To confirm that the dataset was loaded successfully, we displayed the first five lines 
of the dataset, as shown in the following figure: 

CRIM ZN INDUS CHAS DIS RAD TAX PTRATIO B LSTAT MEDV 
0.00632 18.0 2.31 : 4.0900 296.0 15.3 396.90 4.98 24.0 
0.02731 0.0 7.07 | 4.9671 242.0 17.8 39690 914 21.6 
0.02729 0.0 7.07 | 31.1 4.9671 242.0 17.8 39283 4.03 347 


0.03237 0.0 2.18 | 6.0622 222.0 18.7 394.63 2.94 33.4 


_ 0.06905 0.0 2.18 , .2 6.0622 3 222.0 18.7 396.90 5.33 36.2 





Note 


You can find a copy of the Housing dataset (and all other datasets used in this book) 
in the code bundle of this book, which you can use 1f you are working offline or the 


web link https://raw. githubusercontent.com/rasbt/python-machine-learning-book- 


2nd-edition/master/code/ch10/housing.data.txt 1s temporarily unavailable. For 
instance, to load the Housing dataset from a local directory, you can replace these 


lines: 


df = pd.read csv('https://raw.githubusercontent.com/rasbt/' 
"python-machine-learning-book-2nd-edition' 
'/master/code/chl0/housing.data.txt', 
sep='\st') 


Replace them 1n the following code example with this: 


df = pd.read csv('./housing.data.txt'), sep='\st') 
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Visualizing the important characteristics of a 
dataset 


Exploratory Data Analysis (EDA) is an important and recommended first step 
prior to the training of a machine learning model. In the rest of this section, we will 
use some simple yet useful techniques from the graphical EDA toolbox that may 
help us to visually detect the presence of outliers, the distribution of the data, and the 
relationships between features. 


First, we will create a scatterplot matrix that allows us to visualize the pair-wise 
correlations between the different features in this dataset in one place. To plot the 
scatterplot matrix, we will use the pairplot function from the Seaborn library 


(http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library for 
drawing statistical plots based on Matplotlib. 


You can install the seaborn package via conda install seaborn Or pip install 
seaborn. After the installation is complete, you can import the package and create 
the scatterplot matrix as follows: 


Poo ijAmporte MarcpLoclib«pyplor as ple 

>>> import seaborn as sns 

>>> Cols = ["ESTAT’, *“INDUS*, "NOx", “RM*, *MEDV"] 
Pom SNS .«Ppalr plotter lCcols|, S1.26=2.5) 

Po? Pics eOne -a youl) 

>>> plt.show() 


As we can see in the following figure, the scatterplot matrix provides us with a 
useful graphical summary of the relationships 1n a dataset: 
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Due to space constraints and in the interest of readability, we only plotted five 
columns from the dataset: LSTAT, INDUS, NOx, RM, and MEDV. However, you are 
encouraged to create a scatterplot matrix of the whole DataFrame to explore the 
dataset further by choosing different column names in the previous sns.pairplot 
call, or include all variables in the scatterplot matrix by omitting the column selector 
(sns.pairplot (df) ). 
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Using this scatterplot matrix, we can now quickly eyeball how the data 1s distributed 
and whether it contains outliers. For example, we can see that there is a linear 
relationship between rm and house prices, MEDv (the fifth column of the fourth row). 
Furthermore, we can see in the histogram—the lower-right subplot in the scatter plot 
matrix—that the mepv variable seems to be normally distributed but contains several 
outliers. 


Note 


Note that in contrast to common belief, training a linear regression model does not 
require that the explanatory or target variables are normally distributed. The 
normality assumption is only a requirement for certain statistics and hypothesis tests 
that are beyond the scope of this book Untroduction to Linear Regression Analysis, 
Montgomery, Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, 
Wiley, 2012, pages: 318-319). 


WOW! eBook 
www.wowebook.org 


Looking at relationships using a correlation 
matrix 


In the previous section, we visualized the data distributions of the Housing dataset 
variables in the form of histograms and scatter plots. Next, we will create a 
correlation matrix to quantify and summarize linear relationships between variables. 
A correlation matrix is closely related to the covariance matrix that we have seen in 
the section about Principal Component Analysis (PCA) in Chapter 5, Compressing 
Data via Dimensionality Reduction. Intuitively, we can interpret the correlation 
matrix as a rescaled version of the covariance matrix. In fact, the correlation matrix 
is identical to a covariance matrix computed from standardized features. 


The correlation matrix is a square matrix that contains the Pearson product- 
moment correlation coefficient (often abbreviated as Pearson's r), which measure 
the linear dependence between pairs of features. The correlation coefficients are in 


the range -1 to 1. Two features have a perfect positive correlation if / = I , no 


correlation if ” = , and a perfect negative correlation if /" = —|. As mentioned 
previously, Pearson's correlation coefficient can simply be calculated as the 
covariance between two features x and y (numerator) divided by the product of their 
standard deviations (denominator): 





f, ; F.. . 
Here, is denotes the sample mean of the corresponding feature, *” is the 


J 


; oe 
covariance between the features x and y, and ~ * and are the features' standard 


deviations. 


Note 


We can show that the covariance between a pair of standardized features is in fact 
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equal to their linear correlation coefficient. To show this, let us first standardize the 


' sf 
features x and y to obtain their z-scores, which we will denote as * and , 
respectively: 


X—-fk- , 
x = ~,y= 
OC, 2 


A 


y— 4, 








Remember that we compute the (population) covariance between two features as 
follows: 


oy ==> (2! pu, )(y - 4, ) 


Since standardization centers a feature variable at mean zero, we can now calculate 
the covariance between the scaled features as follows: 


| no . 
o =—)> (x'-0)(y'-0 
o%, => (x-0)(9"-0) 


Through resubstitution, we then get the following result: 
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Finally, we can simplify this equation as follows: 


: O xy 


O»y= 
0.0 


a | 


In the following code example, we will use NumPy's corrcoef function on the five 
feature columns that we previously visualized in the scatterplot matrix, and we will 
use Seaborn's heatmap function to plot the correlation matrix array as a heat map: 


>>> import numpy as np 

oo> Cm — Np.corrcoel (dr | cols|.«values.T) 
Zor Silo sSee( Onl Scale=—1.)) 

>>> hm = sns.heatmap (cm, 

Ccoar=True, 

annot=tTrue, 

Ssouare=[rue, 

Pml="s421 *, 

eannol Kwe=| "Size" 7 lo}, 
yticklabels=cols, 

ae xticklabels=cols) 

>>> plt.show() 


As we can see in the resulting figure, the correlation matrix provides us with another 
useful summary graphic that can help us to select features based on their respective 
linear correlations: 
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To fit a linear regression model, we are interested in those features that have a high 
correlation with our target variable MEpv. Looking at the previous correlation matrix, 
we see that our target variable mEpv shows the largest correlation with the LsTat 
variable (-0.74); however, as you might remember from inspecting the scatterplot 
matrix, there 1s a clear nonlinear relationship between LstTat and MEpDv. On the other 
hand, the correlation between rm and mEpv 1s also relatively high (0.70). Given the 
linear relationship between these two variables that we observed 1n the scatterplot, Rm 
seems to be a good choice for an exploratory variable to introduce the concepts of a 
simple linear regression model in the following section. 
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Implementing an ordinary least squares 
linear regression model 


At the beginning of this chapter, we mentioned that linear regression can be 
understood as obtaining the best-fitting straight line through the sample points of our 
training data. However, we have neither defined the term best-fitting nor have we 
discussed the different techniques of fitting such a model. In the following 
subsections, we will fill in the missing pieces of this puzzle using the Ordinary 
Least Squares (OLS) method (sometimes also called linear least squares) to 
estimate the parameters of the linear regression line that minimizes the sum of the 
squared vertical distances (residuals or errors) to the sample points. 
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Solving regression for regression parameters 
with gradient descent 


Consider our implementation of the ADAptive LInear NEuron (Adaline) from 
Chapter 2, Training Simple Machine Learning Algorithms for Classification; we 
remember that the artificial neuron uses a linear activation function. Also, we 


defined a cost function ( ) , which we minimized to learn the weights via 

optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient 
Descent (SGD). This cost function in Adaline is the Sum of Squared Errors (SSE), 
which is identical to the cost function that we use for OLS: 


1 (09-5) 


l 


Here, ” is the predicted value 7 = (note that the term 2 is just used for 
convenience to derive the update rule of GD). Essentially, OLS regression can be 
understood as Adaline without the unit step function so that we obtain continuous 
target values instead of the class labels -1 and 1. To demonstrate this, let us take the 
GD implementation of Adaline from Chapter 2, 7raining Simple Machine Learning 
Algorithms for Classification and remove the unit step function to implement our 
first linear regression model: 


<x 


class LinearRegressionGD (object): 


Gel. Ani sell, Gva-U,00l, 1 12eerHZ0)= 
SeeltieCic = Cid 
Selah, eee = ey uc 


def fit(self, X, y): 
Seli.W.. = N2p.Z2eros(! + AsShape| |) 
Seli«COst. = |! 


GOG 2. 1 Pange (sell. 10e1)% 
OULDUL. = Sel7.ner tnpur (x) 
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errors = (y - output) 
Sselitcw [| ls) += Sselt.eta * A.leoor (errors) 
Selraw [Ol a= Sselt.eba * Cerone. fum |) 
COsSt = (errors**2) .sum() 7 2.0 
Se sCOck, »eappeno COT) 

FeLuUrn Sel. 


def net input(self, X): 
PeCuIn. np.dou(x, Seliww [le}) a selisw 04 


def predict(self, X): 
return Sell snel 1npur (x) 


Note 


If you need a refresher about how the weights are being updated—taking a step into 
the opposite direction of the gradient—please revisit the Adaptive linear neurons and 
the convergence of learning section in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification. 


To see Our LinearRegressionGD regressor in action, let's use the Rv (number of 
rooms) variable from the Housing dataset as the explanatory variable and train a 
model that can predict mepv (house prices). Furthermore, we will standardize the 
variables for better convergence of the GD algorithm. The code 1s as follows: 


>>> X = dt[['RM']].values 

>>> y = af['MEDV'].values 

>>> from sklearn.preprocessing import StandardScaler 

Por SC. x = Stancarascaler() 

277 SCY = Stancarascaler() 

Poo oe Oe. = SC Fy ae Orin) 

Per VY Sto, = sc Ysttt transform (Vis, Mp.newax1s)).clavtten() 
>>> lr = LinearRegressionGD () 

eee AWeeT I Cop YY S00) 


Notice the workaround regarding y_ std, using np.newaxisx and flatten. Most 
transformers in scikit-learn expect data to be stored 1n two-dimensional arrays. In the 
previous code example, the use of no.newaxis INy[:, np.newaxis] added a new 
dimension to the array. Then, after the standardScaler returned the scaled variable, 
we converted it back to the original one-dimensional array representation using the 
flatten () method for our convenience. 


We discussed in Chapter 2, Training Simple Machine Learning Algorithms for 
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Classification that it is always a good idea to plot the cost as a function of the 
number of epochs passes over the training dataset when we are using optimization 
algorithms, such as gradient descent, to check the algorithm converged to a cost 
minimum (here, a global cost minimum): 


>>> sns 
> 
=> ioe 
> Le 


.reset orig() # resets matplotlib style 
LOE (reange Wh, Lim 2Cerri), 2.cose ) 
»ylabel (*S5E") 

.Xlabel ('Epoch') 

>>> PLL. 


show () 


As we can see in the following plot, the GD algorithm converged after the fifth 
epoch: 





Next, let's visualize how well the linear regression line fits the training data. To do 
so, we will define a simple helper function that will plot a scatterplot of the training 
samples and add the regression line: 
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>>> def lin regplot(X, y, model): 
plt.scatter(X, y, c='steelblue', edgecolor='white', s=70) 
plt.plot(X, model.predict (X), color='black', lw=2) 
return None 


Now, we will use this lin regplot function to plot the number of rooms against 
house price: 


Pre tit: LEGDLOU(x Std, Y sta, 1%) 

>>> plt.xlabel('Average number of rooms [RM] (standardized) ') 
>>> plt.ylabel('Price in $1000s [MEDV] (standardized) ') 

>>> plt.show() 


As we can see 1n the following plot, the linear regression line reflects the general 
trend that house prices tend to increase with the number of rooms: 
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Average number of rooms [RM] (standardized) 


Although this observation makes intuitive sense, the data also tells us that the 
number of rooms does not explain the house prices very well in many cases. Later in 
this chapter, we will discuss how to quantify the performance of a regression model. 
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Interestingly, we also observe that several data points lined up at yes , which 
suggests that the prices may have been clipped. In certain applications, it may also be 
important to report the predicted outcome variables on their original scale. To scale 
the predicted price outcome back onto the Price in $1000s axis, we can simply 
apply the inverse transform method of the StandardScaler: 


vor MOM Ooms Std: = SC x.bransTorm( (5 <0) ) 
Por PECs SLC. = we. predice (num: 2Oooms..51 0) 
>>> print ("Price an Sl000s: 2.32" =. \ 

es SC VY.iInvetse Transform (price std)) 
Price in 2LOO0USs: 10.040 


In this code example, we used the previously trained linear regression model to 
predict the price of a house with five rooms. According to our model, such a house 1s 
worth $10,840. 


On a side note, it is also worth mentioning that we technically don't have to update 
the weights of the intercept if we are working with standardized variables since the 
y-axis intercept is always 0 in those cases. We can quickly confirm this by printing 
the weights: 


So PETE olLOpe. Geo @ tie eh.) 
Slope: 0.695 

Per DIAN tater eCepE:: ceol”y «=< Iraw 101) 
Intercept: -0.000 
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Estimating coefficient of a regression model via 
scikit-learn 


In the previous section, we implemented a working model for regression analysis; 
however, 1n a real-world application we may be interested in more efficient 
implementations. For example many of scikit-learn's estimators for regression make 
use of the LIBLINEAR library, advanced optimization algorithms, and other code 
optimizations that work better with unstandardized variables, which is sometimes 
desirable for certain applications: 


zo? LLM Shear lslinear Model AMpOLe LinecerRegress1on 
>>> slr = LinearRegression () 

ye Slee i ieing 

yor PEIOE OLOpe: cob «© Sleecoer, 01) 

Slope: 9.102 

poe Prine *Labereepts cw.3t° « Slt, a neerecepe | 
Intercept: -34.671 


As we can see from executing this code, scikit-learn's LinearRegression model, 
fitted with the unstandardized rm and MeEpv variables, yielded different model 
coefficients. Let's compare it to our GD implementation by plotting Mepv against Rm: 


Poe Ad EOCGOLOL (xX, Vy SEC) 

>>> plt.xlabel('Average number of rooms [RM] (standardized) ') 
>>> plt.ylabel('Price in $1000s [MEDV] (standardized) ') 

>>> plt.show() 


Now, when we plot the training data and our fitted model by executing this code, we 
can see that the overall result looks identical to our GD implementation: 
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Average number of rooms [RM] 


Note 


As an alternative to using machine learning libraries, there 1s also a closed-form 
solution for solving OLS involving a system of linear equations that can be found in 
most introductory statistics textbooks: 


— 
[ 


w= (X7X) age 


We can implement it in Python as follows: 


# adding a column vector of "ones" 

>>> Xb = np.hstack((np.ones((X.shape[0O], 1)), X)) 
>>> Ww np.zeros (X.shape[1]) 

>>> z = np.linalg.inv(np.dot(Xb.T, Xb)) 
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22> W = Np<OGOL(Z, Mp.COL(%b.7, y)) 
>>> print('Slope: %.3f' © w[l1]) 
Slope? 9.102 

Per PEEL IMGerCepl: @.OL” «© wi) ) 
Intercept: -34.671 


The advantage of this method is that it 1s guaranteed to find the optimal solution 
analytically. However, if we are working with very large datasets, it can be 
computationally too expensive to invert the matrix in this formula (sometimes also 
called the normal equation) or the sample matrix may be singular (non-invertible), 
which 1s why we may prefer iterative methods in certain cases. 


If you are interested in more information on how to obtain normal equations, I 
recommend you take a look at Dr. Stephen Pollock's chapter The Classical Linear 
Regression Model from his lectures at the University of Leicester, which 1s available 
for free at: 
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Fitting a robust regression model using 
RANSAC 


Linear regression models can be heavily impacted by the presence of outliers. In 
certain situations, a very small subset of our data can have a big effect on the 
estimated model coefficients. There are many statistical tests that can be used to 
detect outliers, which are beyond the scope of the book. However, removing outliers 
always requires our own judgment as data scientists as well as our domain 
knowledge. 


As an alternative to throwing out outliers, we will look at a robust method of 
regression using the RANdom SAmple Consensus (RANSAC) algorithm, which 
fits a regression model to a subset of the data, the so-called inliers. 


We can summarize the iterative RANSAC algorithm as follows: 


1. Select a random number of samples to be inliers and fit the model. 

2. Test all other data points against the fitted model and add those points that fall 

within a user-given tolerance to the inliers. 

Refit the model using all inliers. 

Estimate the error of the fitted model versus the inliers. 

5. Terminate the algorithm if the performance meets a certain user-defined 
threshold or if a fixed number of iterations were reached; go back to step 1 
otherwise. 


saa sis 


Let us now wrap our linear model in the RANSAC algorithm using scikit-learn's 
RANSACRegressor class: 


ver LLOm ekilearn.binear. model. amport. RANSACKeoressor 
>>> ransac = RANSACRegressor (LinearRegression(), 

max trials=100, 

min samples=0, 

IOso= e0sOlMlUle 8Oos.-z 
resiauel thiresnolo=5<.0; 
— random, State=0) 

>>> ransac.f1it(X, y) 


We set the maximum number of iterations of the RANSACRegressor to 100, and using 
min samples=50, we set the minimum number of the randomly chosen samples to be 
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at least 50. Using the 'absolute loss' as an argument for the residual metric 
parameter, the algorithm computes absolute vertical distances between the fitted line 
and the sample points. By setting the residual threshold parameter to 5.0, we 
only allowed samples to be included 1n the inlier set 1f their vertical distance to the 
fitted line 1s within 5 distance units, which works well on this particular dataset. 


By default, scikit-learn uses the MAD estimate to select the inlier threshold, where 
MAD stands for the Median Absolute Deviation of the target values y. However, 
the choice of an appropriate value for the inlier threshold 1s problem-specific, which 
is one disadvantage of RANSAC. Many different approaches have been developed in 
recent years to select a good inlier threshold automatically. You can find a detailed 
discussion in: Automatic Estimation of the Inlier Threshold in Robust Multiple 
Structures Fitting, R. Toldo, A. Fusiello's, Springer, 2009 (in Image Analysis and 
Processing—ICIAP 2009, pages: 123-131). 


After we fit the RANSAC model, let's obtain the inliers and outliers from the fitted 
RANSAC-linear regression model and plot them together with the linear fit: 


Por Anlier Mask = tansec.1f lier mask 

Por OULLIGY Mask. = np.logical Nor (inlier mask) 

Zo JIMS = Npsetange (oc, LU, 1) 

Por JMS YY ransac. = Tansac.predicre (line Alt, Npshewaxis] ) 

Por PLU. SeCacce. (Canter Meskiy Yl tole mock), 
c='steelblue', edgecolor='white', 

oes marker='o', label='Inliers') 

vee PllcscCalvuer (xX Ouclier mask], ylourller mask], 
c='limegreen', edgecolor='white', 

Sas marker='s', label='Outliers'") 

Per Pills PlOel Line xy, J1ne y ransac;, Color="Dlack”, Iw=2) 

>>> plt.xlabel ('Average number of rooms [RM]') 

>>> plt.ylabel('Price in $1000s [MEDV] ') 

>>> plt.legend(loc='"upper left') 

>>> plt.show() 


As we can see in the following scatterplot, the linear regression model was fitted on 
the detected set of inliers, shown as circles: 
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When we print the slope and intercept of the model by executing the following code, 
we can see that the linear regression line is slightly different from the fit that we 
obtained in the previous section without using RANSAC: 


Pom DIANE Slope: ssot" & taneac.vestimalor «coer | 01.) 
>LoOpe: L0.7s5 


Poo Primed IE CerCe Ot. cut” — FanseCeeorIMatOr since cep..) 
Intercept: -44.089 


Using RANSAC, we reduced the potential effect of the outliers in this dataset, but 
we don't know if this approach has a positive effect on the predictive performance 
for unseen data. Thus, in the next section we will look at different approaches to 
evaluating a regression model, which is a crucial part of building systems for 
predictive modeling. 
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Evaluating the performance of linear 
regression models 


In the previous section, we learned how to fit a regression model on training data. 
However, you learned in previous chapters that it 1s crucial to test the model on data 
that it hasn't seen during training to obtain a more unbiased estimate of its 
performance. 


As we remember from Chapter 6, Learning Best Practices for Model Evaluation and 
Hyperparameter Tuning, we want to split our dataset into separate training and test 
datasets where we use the former to fit the model and the latter to evaluate its 
performance to generalize to unseen data. Instead of proceeding with the simple 
regression model, we will now use all variables in the dataset and train a multiple 
regression model: 


yer TeOm. Skilearti.mocel SeleeClron amporr Tiain tese Splice 
>>> X = df.iloc[:, :-1].values 

>>> y = df['MEDV'] .values 

PoP & elaim, % Test, Y train, y vest = train Lest. spb 
oes X, y, test size=0.3, random state=0) 

>>> slr = LinearRegression () 

por Slits tie(x train, Yo train) 

Zoe Yo Gain pred = Slt.predicre(% train) 

Pee Vy Test. pred = slr.prediecr(x% vest) 


Since our model uses multiple explanatory variables, we can't visualize the linear 
regression line (or hyperplane to be precise) in a two-dimensional plot, but we can 
plot the residuals (the differences or vertical distances between the actual and 
predicted values) versus the predicted values to diagnose our regression model. 
Residual plots are a commonly used graphical tool for diagnosing regression 
models. They can help detect nonlinearity and outliers, and check whether the errors 
are randomly distributed. 


Using the following code, we will now plot a residual plot where we simply subtract 
the true target variables from our predicted responses: 


PoP PrlesCacrer(y Crain pred, YY train pred = \. Crain, 
c="'steelblue', marker='o', edgecolor='white', 

ies label='Training data') 

Zo? DilsSCaclcer(y Lest pred, “y test prea = y ‘test, 


WOW! eBook 
www.wowebook.org 


c='limegreen', marker='s', edgecolor='white', 
label='Test data') 

"Predicted values") 

"Residuals') 


Pee Pit. klabe! 
eer DLs Lebe | 
>>> plt.legend(loc='upper left') 

>>> plt.hlines (y=0;, xman=-10, xmax=50, color="black’, Jw=2) 
oor Dl. <lam([T=L0, S0]) 

>>> plt.show() 


ON ON ON O™ 


After executing the code, we should see a residual plot with a line passing through 
the x-axis origin as shown here: 


e Training data 
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In case of a perfect prediction, the residuals would be exactly zero, which we will 
probably never encounter in realistic and practical applications. However, for a good 
regression model, we would expect that the errors are randomly distributed and the 
residuals should be randomly scattered around the centerline. If we see patterns in a 
residual plot, 1t means that our model is unable to capture some explanatory 
information, which has leaked into the residuals, as we can slightly see in our 


WOW! eBook 
www.wowebook.org 


previous residual plot. Furthermore, we can also use residual plots to detect outliers, 
which are represented by the points with a large deviation from the centerline. 


Another useful quantitative measure of a model's performance 1s the so-called Mean 
Squared Error (MSE), which is simply the averaged value of the SSE cost that we 
minimized to fit the linear regression model. The MSE 1s useful to compare different 
regression models or for tuning their parameters via grid search and cross-validation, 
as it normalizes the SSE by the sample size: 


MSE = z 3 ( yl — 5) 


Ty 


Let's compute the MSE of our training and test predictions: 


Poe Lrom SklLearnvmetrr1cs AMpPOLE Mean Squared. error 
>>> print('MSE train: %.3f, test: %.3f' % ( 

mean Squared error(y train, y train pred), 
a Meal: SCUarCd Crroriy Test, VY. test. pred) )-) 
MSE. train: 19.956, cescs 2i«l9G 


We see that the MSE on the training set 1s 19.96, and the MSE of the test set is much 
larger, with a value of 27.20, which is an indicator that our model 1s overfitting the 
training data. 


Sometimes it may be more useful to report the coefficient of determination ( Ke ), 
which can be understood as a standardized version of the MSE, for better 


interpretability of the model's performance. Or in other words, K” is the fraction of 


response variance that is captured by the model. The R° value is defined as: 


R* = _ SSE 


Sot 


Here, SSE is the sum of squared errors and SST is the total sum of squares: 
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SST =>" (9 -n, 


In other words, SST is simply the variance of the response. 


Let us quickly show that R° is indeed just a rescaled version of the MSE: 


SSE 


4 —)]——___ 


Sol’ 


MSE 


»__ ok 
Var(y) 


For the training dataset, the R° is bounded between 0 and 1, but it can become 
negative for the test set. If &” =!, the model fits the data perfectly with a 
corresponding MSE = 0. 
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Evaluated on the training data, the R of our model is 0.765, which doesn't sound 


too bad. However, the * ~ on the test dataset is only 0.673, which we can compute 
by executing the following code: 


Per Lom. SKLeati.MeSlrics LNpOre 2 SCOre 
Seo DIrine( RZ Crain: csol, Lest! c.5f* @ 
(TZ SCOre(y Urain, y Urain pred), 
save tZ SCOPE (Y est, YY. test. pred) )) 
R°2! treain® 0.700; Lest: 0.673 
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Using regularized methods for 
regression 


As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using scikit- 
learn, regularization 1s one approach to tackle the problem of overfitting by adding 
additional information, and thereby shrinking the parameter values of the model to 
induce a penalty against complexity. The most popular approaches to regularized 
linear regression are the so-called Ridge Regression, Least Absolute Shrinkage 
and Selection Operator (LASSO), and Elastic Net. 


Ridge regression is an L2 penalized model where we simply add the squared sum of 
the weights to our least-squares cost function: 


HT | - aig a 
J ( W) idee - Yo” _ 3) +All w 
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By increasing the value of hyperparameter “ , we increase the regularization 
strength and shrink the weights of our model. Please note that we don't regularize the 
Wy 

intercept term ; 


An alternative approach that can lead to sparse models is LASSO. Depending on the 
regularization strength, certain weights can become zero, which also makes LASSO 
useful as a supervised feature selection technique: 
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However, a limitation of LASSO 1s that it selects at most n variables if m>n. A 
compromise between Ridge regression and LASSO 1s Elastic Net, which has an L1 
penalty to generate sparsity and an L2 penalty to overcome some of the limitations of 
LASSO, such as the number of selected variables: 


Ni 


J(w) _ aw = (9! )— 90) AY +A, 2, 


Those regularized regression models are all available via scikit-learn, and the usage 
is similar to the regular regression model except that we have to specify the 


regularization strength via the parameter “ , for example, optimized via k-fold cross- 
validation. 

A Ridge regression model can be initialized via: 

eer LOM SklGarn.iinear model amporl Riage 

>>> ridge = Ridge (alpha=1.0) 

Note that the regularization strength is regulated by the parameter alpha, which 1s 


similar to the parameter A Likewise, we can initialize a LASSO regressor from the 
linear model submodule: 
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2o> from Sklearn.linear model 1amport lasso 
>>> lasso = Lasso (alpha=1.0) 


Lastly, the ElasticNet implementation allows us to vary the L1 to L2 ratio: 


Por Lrom. Sklearn. linear model s2npore. mlastagNet 
oe Slanel = Mast ichier(alona=1,0, di felvo=0.0) 


For example, if we set the 11 ratio to 1.0, the ElasticNet regressor would be equal 
to LASSO regression. For more detailed information about the different 
implementations of linear regression, please see the documentation at http://scikit- 


learn.org/stable/modules/linear_model.html. 
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Turning a linear regression model into 
a curve — polynomial regression 


In the previous sections, we assumed a linear relationship between explanatory and 
response variables. One way to account for the violation of linearity assumption is to 
use a polynomial regression model by adding polynomial terms: 


: 7 2 od 
Y=W twxtwx +...- WX 


Here, d denotes the degree of the polynomial. Although we can use polynomial 
regression to model a nonlinear relationship, it 1s still considered a multiple linear 
regression model because of the linear regression coefficients w. In the following 
subsections, we will see how we can add such polynomial terms to an existing 
dataset conveniently and fit a polynomial regression model. 
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Adding polynomial terms using scikit-learn 


We will now learn how to use the PolynomialFeatures transformer class from 
scikit-learn to add a quadratic term (d = 2) to a simple regression problem with one 
explanatory variable. Then, we compare the polynomial to the linear fit following 
these steps: 


1. Add a second degree polynomial term: 


from sklearn.preprocessing import PolynomialFeatures 

Soo = Dp.array( | 25080, 270.0, 294.0, 220.0, 342.0, 
368.0, 396.0, 446.0, 480.0, 586.0])\ 

ee [:, npo.newaxi1s] 

> = No.erray (| 23664, 2e4242 252.6, 26sec, S12 22, 

nee G4 Ze2g 200487 365407 S9la2;, 390.0]') 

>>> lr = LinearRegression () 


>>> pr = LinearRegression () 
>>> quadratic = PolynomialFeatures (degree=2) 


Poe x& Gued. = Guacdravic.i1t Cranstorm (x) 
2. Fit a simple linear regression model for comparison: 


Ao Ace Ky VV) 
Poo ke TAG = Nparange (250; C00;,10) lity DNpsnewaxis| 
eee i ie SS ree Ce te 


3. Fit a multiple regression model on the transformed features for polynomial 
regression: 


Poe Diehl x Quad, 7) 
Soe Yy Ouaeo Tat, = Pr.preo1-Ce | Gudorarvec.t 1 transtoOrm( x .210)) 


4. Plot the results: 


>>> plt.scatter(X, y, label='training points') 
pee Dike gooey, mii, 3 gid aie, 

a label='linear fit', linestyle='--') 
Po DlLe.PlOL(% 1, -y Qued fi, 

ae label='quadratic fit') 

>>> plt.legend(loc="upper left') 

>>> plt.show() 


In the resulting plot, we can see that the polynomial fit captures the relationship 
between the response and explanatory variable much better than the linear fit: 
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-—=- linear fit 
—— quadratic fit 
@ training points 





Poe VY Aa Pred = 1.preccr( x) 

Por VY Quad Pre@d = Pr. predi ce (% quad) 

>>> print('Training MSE linear: %.3f, quadratic: «.3f' % ( 
mean squared error(y, y_lin pred), 

ae MGan SGuareo Crroriyy, 7 Guea pred), ) 

Training Mom Jinear: 569.700, Quadratic: 61.350 

POS Print * Training RZ Jineaers c.5ty GQuedratic? Sor’ = | 
rZ_score(y, y_ lin pred), 

ee t2 SCOre ly, Y.GQuaed pred) )) 

Training RZ Jlitear? 0.032, cuadraric: 0.982 


As we can see after executing the code, the MSE decreased from 570 (linear fit) to 
61 (quadratic fit); also, the coefficient of determination reflects a closer fit of the 


quadratic model ( R° = 0.982 ) as opposed to the linear fit ( R° = 0.832 ) in this 
particular toy problem. 
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Modeling nonlinear relationships in the Housing 
dataset 


After we learned how to construct polynomial features to fit nonlinear relationships 
in a toy problem, let's now take a look at a more concrete example and apply those 
concepts to the data in the Housing dataset. By executing the following code, we will 
model the relationship between house prices and LstTat (percent lower status of the 
population) as using second degree (quadratic) and third degree (cubic) polynomials 
and compare it to a linear fit: 


>>> X = dtE[['LSTAT']].values 
>>> y = df['MEDV'].values 


>>> regr = LinearRegression () 


# create quadratic features 

>>> quadratic = PolynomialFeatures (degree=2) 
>>> cubic = PolynomialFeatures (degree=3) 

eo i Quad = Cuaeciaric. fit Transrorm (x) 

Poe x CUD = CUbTme. It trans Orn) 


- fit: Pearures 
oor & Fle = D,arange(..min(); Auext), |) ey DpsMewexrts) 


>>> regr = regr.f1it(X, y) 
Prey iii Til = Peqr.predirce(s% £20) 
po? aMeakt t2 =. 12 SeCOore(y, Legr.predicr (),) 


Zor weGr = Tegt.f1t (x Quad, YY) 
PoP Vy Oued. fil. = £egrepredi Cli quacratic.fit Cranstorm(x% Tie) ) 
eer QuaOralic 2 = £7 SCOrely, Legespredicl(% guadc) ) 


Pee eG = Pega tLe. Cultc, 
pe 7 UO Te = PeGiep rect Ce(CUbicC. LE. Creole. te) ) 
Por CUCM 12 = TZ SCOLrely, LeGre«predice(x% Cubic) ) 


# plot results 
>>> plt.scatter(X, y, label='"training points', color='lightgray') 


Per Plt, PlLOu(e Lit, YY 1in Luc, 
label="linear (d=1), $R*2=%3.2f5' 3 linear r2, 
COLOr="bDikue” , 
lw=2, 
linestyle=':') 
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>>> DPlLeeplLOutxe tik, Y Qued 21.0; 
label="quadratic (d=2), $R*°2=%.2f5' % quadratic r2, 
color='red', 
lw=2, 
linestyle='-') 


Po? DiLeplLOoc(x Tity Y Cubic Tix, 
label='"cubic (d=3), $R*%2=%3.2f$' 3 cubic rz, 
color='green', 
lw=2, 
linestyle='--') 


>>> plt.xlabel('s lower status of the population [LSTAT]') 
>>> plt.ylabel('Price in $1000s [MEDV]') 

>>> plt.legend(loc="upper right") 

>>> plt.show() 


The resulting plot is as follows: 
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As we can see, the cubic fit captures the relationship between house prices and 
LSTAT better than the linear and quadratic fit. However, we should be aware that 
adding more and more polynomial features increases the complexity of a model and 
therefore increases the chance of overfitting. Thus, in practice it is always 
recommended to evaluate the performance of the model on a separate test dataset to 
estimate the generalization performance. 


In addition, polynomial features are not always the best choice for modeling 
nonlinear relationships. For example, with some experience or intuition, just looking 
at the MEDV-LSTAT scatterplot may lead to the hypothesis that a log- 
transformation of the LSTAT feature variable and the square root of MEDV may 
project the data onto a linear feature space suitable for a linear regression fit. For 
instance, my perception 1s that this relationship between the two variables looks 
quite similar to an exponential function: 


Since the natural logarithm of an exponential function 1s a straight line, I assume that 
such a log-transformation can be usefully applied here: 


log( f(x))=—x 


Let's test this hypothesis by executing the following code: 


# £Yranstorm features 
POF & LO = Nps 10g (4) 
vee Ore = Misco) 


# fit features 

Poo x Lie = Mp.earange (x 1o¢.m1 nt) a1, 

a xX J§Ogsmax(.) tly 1) li, Apsnewaxis | 
Por LeOr = fegt. tie x Log, y:Sqrc) 

yee SF I a, = eG. prearCre (x, Fie) 

eo? IAMear LZ = EZ SCOre(y SOre, Leqr., predict (x od), 
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# plot results 
Fee Dit eoCaeeee , bOG, Vso, 
label='training points', 
Sa color='lightgray') 
yee Pete OlLOUl(x tit, Y La fr, 
label="linear (d=1), $R*2=%3.2f5' % linear r2, 
CoOLOr="blue* , 
S28 lw=2) 
>>> plt.xlabel('log(% lower status of the population [LSTAT])') 
>>> plt.ylabel('S\sgrt{Price \; in \; \$1000s \; [MEDV]}$") 
>>> plt.legend(loc='lower left") 
>>> plt.show() 


After transforming the explanatory onto the log space and taking the square root of 
the target variables, we were able to capture the relationship between the two 


ee 2_() 6c 
variables with a linear regression line that seems to fit the data better (Kt = 0.69 ) 
than any of the polynomial feature transformations previously: 
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Dealing with nonlinear relationships 
using random forests 


In this section, we are going to take a look at random forest regression, which is 
conceptually different from the previous regression models in this chapter. A random 
forest, which is an ensemble of multiple decision trees, can be understood as the 
sum of piecewise linear functions in contrast to the global linear and polynomial 
regression models that we discussed previously. In other words, via the decision tree 
algorithm, we are subdividing the input space into smaller regions that become more 
manageable. 
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Decision tree regression 


An advantage of the decision tree algorithm is that it does not require any 
transformation of the features 1f we are dealing with nonlinear data. We remember 
from Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, that we 
grow a decision tree by iteratively splitting its nodes until the leaves are pure or a 
stopping criterion is satisfied. When we used decision trees for classification, we 
defined entropy as a measure of impurity to determine which feature split maximizes 
the Information Gain (IG), which can be defined as follows for a binary split: 

— : Nis ' | 
IG( D, x,)=1(D, )- — I(D,_)- — (Dew 


i ¥ I r P| i 4 


. . NE, : 
Here, x is the feature to perform the split, ” 1s the number of samples 1n the parent 


node, /is the impurity function, ” is the subset of training samples at the parent 


node, and Mien and P right are the subsets of training samples at the left and right 
child node after the split. Remember that our goal is to find the feature split that 
maximizes the information gain; or in other words, we want to find the feature split 
that reduces the impurities in the child nodes most. In Chapter 3, A Tour of Machine 
Learning Classifiers Using scikit-learn we discussed Gini impurity and entropy as 
measures of impurity, which are both useful criteria for classification. To use a 
decision tree for regression, however, we need an impurity metric that is suitable for 
continuous variables, so we define the impurity measure of a node t as the MSE 
instead: 


“) 


I(t) = MSE(t)=——(v" -3,) 


N . ieD, 
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AT 
Here, ‘ is the number of training samples at node ¢, 
(i) ; 


i 


is the training subset at 


node t, "is the true target value, and ~’' is the predicted target value (sample 
mean): 
-~ | ~\ | I ) 
5 — _ \ 
a t J 


In the context of decision tree regression, the MSE 1s often also referred to as 
within-node variance, which is why the splitting criterion is also better known as 
variance reduction. To see what the line fit of a decision tree looks like, let us use 
the DecisionTreeRegressor implemented 1n scikit-learn to model the nonlinear 
relationship between the mepv and Lstat variables: 


>>> from sklearn.tree import DecisionTreeRegressor 


>>> X = d£E[['LSTAT']].values 
>>> y = af['MEDV'].values 
Pee tres = Decision Tresckegressor (max depih=3) 


>>> tree. fit(X, y) 

Per BOLL AO = Asi varcten() serqsore() 

por Adit LEGO LOL OLTSOre 10x), Visore 1ax), eres) 

>>> plt.xlabel('s lower status of the population [LSTAT]') 
>>> plt.ylabel('Price in $1000s [MEDV]') 

>>> plt.show() 


As we can see in the resulting plot, the decision tree captures the general trend in the 
data. However, a limitation of this model 1s that it does not capture the continuity 
and differentiability of the desired prediction. In addition, we need to be careful 
about choosing an appropriate value for the depth of the tree to not overfit or underfit 
the data; here, a depth of three seemed to be a good choice: 
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In the next section, we will take a look at a more robust way of fitting regression 


trees: random forests. 
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Random forest regression 


As we learned in Chapter 3, A Tour of Machine Learning Classifiers Using scikit- 
learn, the random forest algorithm 1s an ensemble technique that combines multiple 
decision trees. A random forest usually has a better generalization performance than 
an individual decision tree due to randomness, which helps to decrease the model's 
variance. Other advantages of random forests are that they are less sensitive to 
outliers in the dataset and don't require much parameter tuning. The only parameter 
in random forests that we typically need to experiment with is the number of trees in 
the ensemble. The basic random forest algorithm for regression is almost identical to 
the random forest algorithm for classification that we discussed in Chapter 3, A Tour 
of Machine Learning Classifiers Using scikit-learn, the only difference 1s that we use 
the MSE criterion to grow the individual decision trees, and the predicted target 
variable is calculated as the average prediction over all decision trees. 


Now, let's use all features in the Housing dataset to fit a random forest regression 
model on 60 percent of the samples and evaluate its performance on the remaining 
40 percent. The code 1s as follows: 


>>> X = df.iloc[:, :-1].values 
>>> y = af['MEDV'].values 
>>> X train, X_ test, y train, y test =\ 
Liain lest Spl vax, ¥; 
Lest S1.76-0.4, 
Fandom State) 


>>> from sklearn.ensemble import RandomForestRegressor 
Pee LOLesU = RancOmroreslthegressor (mn estimators —1000, 
Ccriterion='mse', 
random state=l1, 
a it jObe=—1) 
wer EFOLGS tall (es train, VY Grau) 
por y Cait pred = TOrese,. predic cix train) 
PoP VY oeest pred = forest l.predrce (x cesc) 
>>> print('MSE train: %.3f, test: %.3f' %& ( 

mean squared error({y train, y train pred), 
oo Mean SQuared Crrori(y test, VY Test. prea). ) 
Moe trains 1.642, tes: J) .05z2 
Poo Pelme RZ Eo: G.ste Pesce. Saat” | 4 

EZ. SCOLre(Y trait, Y train pred), 
5 apis c2 SCOre(y Test, YY Test. pred) )) 
R 2 train: OUsetlo, Teste Uscic 
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Unfortunately, we see that the random forest tends to overfit the training data. 
However, it's still able to explain the relationship between the target and explanatory 


variables relatively well ( R° = 0.871 on the test dataset). 


Lastly, let us also take a look at the residuals of the prediction: 


Peo DiLUrscaller(y train pred, 
VY Ureain pred = ¥ Crain, 
c="'steelblue’, 
edgecolor='white', 
marker='o', 
S=35, 
alpha=0.9, 
ars label='Training data') 
yer DilieoCaeeer ly Teck. Preg, 
Veo Deed = 7 tee, 
c='limegreen', 
edgecolor='white', 
marker='s', 
S=35, 
alpha=0.9, 
label='Test data') 
"Predicted values') 
'Residuals') 


eee Plu st lave! 
Por Diti«vyilabe! 
>>> plt.legend(loc='upper left') 

Poo Pili «hlines (y=0, xnrn=—-l10, xmax=50, .w=2;, Color=*bleack") 
Po Plies klLam (lek, 30] 

>>> plt.show() 


ON ON ON O™ 


As it was already summarized by the K” coefficient, we can see that the model fits 
the training data better than the test data, as indicated by the outliers in the y-axis 
direction. Also, the distribution of the residuals does not seem to be completely 
random around the zero center point, indicating that the model is not able to capture 
all the exploratory information. However, the residual plot indicates a large 
improvement over the residual plot of the linear model that we plotted earlier 1n this 
chapter: 
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Ideally, our model error should be random or unpredictable. In other words, the error 
of the predictions should not be related to any of the information contained in the 
explanatory variables, but should reflect the randomness of the real-world 
distributions or patterns. If we observe patterns in the prediction errors, for example, 
by inspecting the residual plot, it means that the residual plots contain predictive 
information. A common reason for this could be that explanatory information is 
leaking into those residuals. 


Unfortunately, there is now a universal approach for dealing with non-randomness in 
residual plots, and it requires experimentation. Depending on the data that 1s 
available to us, we may be able to improve the model by transforming variables, 
tuning the hyperparameters of the learning algorithm, choosing simpler or more 
complex models, removing outliers, or including additional variables. 


Note 
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In Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, we also 
learned about the kernel trick, which can be used in combination with a Support 
Vector Machine (SVM) for classification, and is useful if we are dealing with 
nonlinear problems. Although a discussion is beyond the scope of this book, SVMs 
can also be used in nonlinear regression tasks. The interested reader can find more 
information about SVMs for regression 1n an excellent report: Support Vector 
Machines for Classification and Regression, S. R. Gunn and others, ISIS technical 
report, 14, 1998. An SVM regressor 1s also implemented in scikit-learn, and more 
information about its usage can be found at http://scikit- 


learn.org/stable/modules/generated/sklearn.svm.S VR.html#sklearn.svm.S VR. 
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Summary 


At the beginning of this chapter, you learned about simple linear regression analysis 
to model the relationship between a single explanatory variable and a continuous 
response variable. We then discussed a useful explanatory data analysis technique to 
look at patterns and anomalies 1n data, which is an important first step in predictive 
modeling tasks. 


We built our first model by implementing linear regression using a gradient-based 
optimization approach. We then saw how to utilize scikit-learn's linear models for 
regression and also implement a robust regression technique (RANSAC) as an 
approach for dealing with outliers. To assess the predictive performance of 


regression models, we computed the mean sum of squared errors and the related Ke 
metric. Furthermore, we also discussed a useful graphical approach to diagnose 
problems of regression models: the residual plot. 


After we discussed how regularization can be applied to regression models to reduce 
the model complexity and avoid overfitting, we also introduced several approaches 
to model nonlinear relationships including polynomial feature transformation and 
random forest regressors. 


We have discussed supervised learning, classification, and regression analysis in 
great detail throughout the previous chapters. In the next chapter, we are going to 
learn about another interesting subfield of machine learning, unsupervised learning 
and also we will learn how to use cluster analysis for finding hidden structures in 
data in the absence of target variables. 
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Chapter 11. Working with Unlabeled 
Data — Clustering Analysis 


In the previous chapters, we used supervised learning techniques to build machine 
learning models using data where the answer was already known—the class labels 
were already available in our training data. In this chapter, we will switch gears and 
explore cluster analysis, a category of unsupervised learning techniques that allows 
us to discover hidden structures in data where we do not know the right answer 
upfront. The goal of clustering is to find a natural grouping in data so that items in 
the same cluster are more similar to each other than to those from different clusters. 


Given its exploratory nature, clustering is an exciting topic and, in this chapter, we 
will learn about the following concepts, which can help us to organize data into 
meaningful structures: 


e Finding centers of similarity using the popular k-means algorithm 

e Taking a bottom-up approach to building hierarchical clustering trees 

e Identifying arbitrary shapes of objects using a density-based clustering 
approach 
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Grouping objects by similarity using k- 
means 


In this section, we will learn about one of the most popular clustering algorithms, k- 
means, which is widely used in academia as well as in industry. Clustering (or 
cluster analysis) is a technique that allows us to find groups of similar objects, 
objects that are more related to each other than to objects in other groups. Examples 
of business-oriented applications of clustering include the grouping of documents, 
music, and movies by different topics, or finding customers that share similar 
interests based on common purchase behaviors as a basis for recommendation 
engines. 
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K-means clustering using scikit-learn 


As we will see in a moment, the k-means algorithm 1s extremely easy to implement 
but is also computationally very efficient compared to other clustering algorithms, 
which might explain its popularity. The k-means algorithm belongs to the category 
of prototype-based clustering. We will discuss two other categories of clustering, 
hierarchical and density-based clustering, later in this chapter. 


Prototype-based clustering means that each cluster is represented by a prototype, 
which can either be the centroid (average) of similar points with continuous 
features, or the medoid (the most representative or most frequently occurring point) 
in the case of categorical features. While k-means is very good at identifying clusters 
with a spherical shape, one of the drawbacks of this clustering algorithm is that we 
have to specify the number of clusters, k, a priori. An inappropriate choice for k can 
result in poor clustering performance. Later in this chapter, we will discuss the 
elbow method and silhouette plots, which are useful techniques to evaluate the 
quality of a clustering to help us determine the optimal number of clusters k. 


Although k-means clustering can be applied to data in higher dimensions, we will 
walk through the following examples using a simple two-dimensional dataset for the 
purpose of visualization: 


vor From SKiGarn.datasets AMpOre Make Dlovs 
Por ky VY = Make DLOobSs (hn samp les=!50, 
i Testures—=2, 
centers=3, 
CLUS Sed 207 
shuffle=True, 
random state=0) 


>>> import matplotlib.pyplot as plt 
>>> plt.scatter(X[:,0], 

X[:,1l], 

c='white', 
marker='o', 
edgecolor='"black', 

a ate S=50) 

yee ler o.4) 
>>> plt.show() 


The dataset that we just created consists of 150 randomly generated points that are 
roughly grouped into three regions with higher density, which is visualized via a 
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two-dimensional scatterplot: 





In real-world applications of clustering, we do not have any ground truth category 
information (information provided as empirical evidence as opposed to inference) 
about those samples; otherwise, 1t would fall into the category of supervised 
learning. Thus, our goal is to group the samples based on their feature similarities, 
which can be achieved using the k-means algorithm that can be summarized by the 
following four steps: 


I. 


Z; 
2: 
4. 


Randomly pick « centroids from the sample points as initial cluster centers. 

LJ | 
Assign each sample to the nearest centroid H 
Move the centroids to the center of the samples that were assigned to it. 
Repeat steps 2 and 3 until the cluster assignments do not change or a user- 
defined tolerance or maximum number of iterations 1s reached. 


a | . | 
Pet hessht 
; ; 


Now, the next question 1s how do we measure similarity between objects’? We can 
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define similarity as the opposite of distance, and a commonly used distance for 
clustering samples with continuous features 1s the squared Euclidean distance 
between two points x and y in m-dimensional space: 


il 


d (x, y) - 1s ma ) - x ae 


j=l 








Note that, in the preceding equation, the index / refers to the jth dimension (feature 
column) of the sample points x and y. In the rest of this section, we will use the 
superscripts 7 and j to refer to the sample index and cluster index, respectively. 


Based on this Euclidean distance metric, we can describe the k-means algorithm as a 
simple optimization problem, an iterative approach for minimizing the within-cluster 
Sum of Squared Errors (SSE), which is sometimes also called cluster inertia: 


SSE = yyw J aR Ria Pal 





A | i] (i.7 | 
Here “ is the representative point (centroid) for cluster j, and “=! if the 


Ai) (hay 
sample *  isinclusterj; “= = 0 otherwise. 


Now that we have learned how the simple k-means algorithm works, let's apply it to 
our sample dataset using the KMeans class from scikit-learn's cluster module: 


>>> from sklearn.cluster import KMeans 
pee kin = KMeans (i. Clusters—s, 
init='random', 
A, Ae y 
Max ALeL=30U, 
tol=le-0O4, 
ee fandom. Starte=)) 
oor ¥ Km = Km<E1e pregice (4) 
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Using the preceding code, we set the number of desired clusters to 3; specifying the 
number of clusters a priori is one of the limitations of k-means. We set n init=10 to 
run the k-means clustering algorithms 10 times independently with different random 
centroids to choose the final model as the one with the lowest SSE. Via the max iter 
parameter, we specify the maximum number of iterations for each single run (here, 
300). Note that the k-means implementation in scikit-learn stops early if it converges 
before the maximum number of iterations is reached. However, it 1s possible that k- 
means does not reach convergence for a particular run, which can be problematic 
(computationally expensive) if we choose relatively large values for max iter. One 
way to deal with convergence problems is to choose larger values for to1, which is a 
parameter that controls the tolerance with regard to the changes in the within-cluster 
sum-squared-error to declare convergence. In the preceding code, we chose a 
tolerance of 1e-04 (=0.0001). 


A problem with k-means is that one or more clusters can be empty. Note that this 
problem does not exist for k-medoids or fuzzy C-means, an algorithm that we will 
discuss later in this section. However, this problem is accounted for in the current k- 
means implementation in scikit-learn. If a cluster 1s empty, the algorithm will search 
for the sample that is farthest away from the centroid of the empty cluster. Then it 
will reassign the centroid to be this farthest point. 


Note 


When we are applying k-means to real-world data using a Euclidean distance metric, 
we want to make sure that the features are measured on the same scale and apply z- 
score standardization or min-max scaling 1f necessary. 


After we predicted the cluster labels y km and discussed some of the challenges of 
the k-means algorithm, let's now visualize the clusters that k-means identified in the 
dataset together with the cluster centroids. These are stored under the 

cluster centers attribute of the fitted kMeans object: 


>>> plt.scatter(Xly km == 0, O], 
X[y km == 0, 1], 
s=50, c='lightgreen', 
marker='s', edgecolor='black', 
Sa label='cluster 1") 
Pee PlLi«sCallter (Oy knre= Ty Ul, 
X[y_ km == 1, 1], 
s=50, c='orange', 
marker='o', edgecolor='"black', 
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aes label='cluster 2') 

>>? DPillssCalver( xy km =] 2, 0], 
X[y km == 2, 1], 
S=50, c=" Iagncblue"’, 
marker='v', edgecolor='"black', 

a A label='cluster 3') 

Po? DIR sCatver (ktsCluster Centers [2, 0), 
KMwClUSteY Centers [ey ly 
Ss=Z5U, Marker="*", 
c='red', edgecolor='black', 

es label='centroids') 

>>> plt.legend(scatterpoints=1) 

Po? pD1t.0L1d () 

>>> plt«show () 


In the following scatterplot, we can see that k-means placed the three centroids at the 
center of each sphere, which looks like a reasonable grouping given this dataset: 


cluster 1 
cluster 2 
cluster 3 
centroids 





Although k-means worked well on this toy dataset, we shall highlight another 
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drawback of k-means: we have to specify the number of clusters, k, a priori. The 
number of clusters to choose may not always be so obvious in real-world 
applications, especially if we are working with a higher dimensional dataset that 
cannot be visualized. The other properties of k-means are that clusters do not overlap 
and are not hierarchical, and we also assume that there is at least one item in each 
cluster. Later in this chapter, we will encounter different types of clustering 
algorithms, hierarchical and density-based clustering. Neither type of algorithm 
requires us to specify the number of clusters upfront or assume spherical structures 
in our dataset. 


In the next subsection, we will introduce a popular variant of the classic k-means 
algorithm called k-means++. While it doesn't address those assumptions and 
drawbacks of k-means discussed 1n the previous paragraph, it can greatly improve 
the clustering results through more clever seeding of the initial cluster centers. 
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A smarter way of placing the initial cluster 
centroids using k-means++ 


So far, we have discussed the classic k-means algorithm that uses a random seed to 
place the initial centroids, which can sometimes result in bad clusterings or slow 
convergence if the initial centroids are chosen poorly. One way to address this issue 
is to run the k-means algorithm multiple times on a dataset and choose the best 
performing model in terms of the SSE. Another strategy is to place the initial 
centroids far away from each other via the k-means++ algorithm, which leads to 
better and more consistent results than the classic k-means (k-means++. The 
Advantages of Careful Seeding, D. Arthur and S. Vassilvitskii in proceedings of the 
eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027- 
1035. Society for Industrial and Applied Mathematics, 2007). The initialization in k- 
means++ can be summarized as follows: 


1. Initialize an empty set M to store the & centroids being selected. 


2. Randomly choose the first centroid # from the input samples and assign it to 

M. 

7 . | 

3. For each sample “that is not in M, find the minimum squared distance 

d (x ) MI ) a 

to any of the centroids in M. 
(Pp) 

4. To randomly select the next centroid # , use a weighted probability 


d (u'” M) 


distribution equal to Lud 3 M)- 


5. Repeat steps 2 and 3 until & centroids are estan: 
6. Proceed with the classic k-means algorithm. 


To use k-means++ with scikit-learn's kMeans object, we just need to set the init 
parameter to 'k-means++'. In fact, 'k-means++' 1s the default argument to the init 
parameter, which is strongly recommended in practice. The only reason why we 
haven't used it in the previous example was to not introduce too many concepts all at 
once. The rest of this section on k-means will use k-means-+-+, but readers are 
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encouraged to experiment more with the two different approaches (classic k-means 
vla init='random' versus k-means++ via init='k-means++') for placing the initial 
cluster centroids. 
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Hard versus soft clustering 


Hard clustering describes a family of algorithms where each sample in a dataset 1s 
assigned to exactly one cluster, as in the k-means algorithm that we discussed in the 
previous subsection. In contrast, algorithms for soft clustering (sometimes also 
called fuzzy clustering) assign a sample to one or more clusters. A popular example 
of soft clustering 1s the fuzzy C-means (FCM) algorithm (also called soft k-means 
or fuzzy k-means). The original idea goes back to the 1970s, when Joseph C. Dunn 
first proposed an early version of fuzzy clustering to improve k-means (4 Fuzzy 
Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated 
Clusters, J. C. Dunn, 1973). Almost a decade later, James C. Bedzek published his 
work on the improvement of the fuzzy clustering algorithm, which is now known as 
the FCM algorithm (Pattern Recognition with Fuzzy Objective Function Algorithms, 
J. C. Bezdek, Springer Science+Business Media, 2013). 


The FCM procedure 1s very similar to k-means. However, we replace the hard 
cluster assignment with probabilities for each point belonging to each cluster. In k- 


means, we could express the cluster membership of a sample x with a sparse vector 
of binary values: 


u +) 
u” _»] 


u' _»() 


7 


7 


r | 


FF 


Ae 
Here, the index position with value | indicates the cluster centroid H the sample 


. . _ k=3, jefl, 2,3! . 
is assigned to (assuming = 1 '). In contrast, a membership vector in 
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FCM could be represented as follows: 
u’ > 0.10 
yu” + 0.85 
u' + 0.05 


Here, each value falls in the range /0, // and represents a probability of membership 
of the respective cluster centroid. The sum of the memberships for a given sample 1s 
equal to 1. Similar to the k-means algorithm, we can summarize the FCM algorithm 
in four key steps: 


1. Specify the number of & centroids and randomly assign the cluster memberships 
for each point. 
ee | -} 
2. Compute the cluster centroids “ , / ~ aaa B 
Update the cluster memberships for each point. 
4. Repeat steps 2 and 3 until the membership coefficients do not change, or a user- 


defined tolerance or maximum number of iterations is reached. 


oS) 


The objective function of FCM—we abbreviate it as Sm —looks very similar to the 
within cluster sum-squared-error that we minimize in k-means: 


iy k a 

= . _, itt if.) Jr) tah 

Jn 7 2. 2: i | _ 7 H | 7 
f=] j=l Z 


\ J . . . 
However, note that the membership indicator is not a binary value as in k- 


(i,j 
) ef9n 


\ . 
means ( '), but a real value that denotes the cluster membership 


Ww) € [0] 
probability iy | l ). You also may have noticed that we added an additional 
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Me) 
exponent to '” —_; the exponent m, any number greater than or equal to one 
(typically m=2), is the so-called fuzziness coefficient (or simply fuzzifier) that 
controls the degree of fuzziness. The larger the value of m the smaller the cluster 
ied) . . 
membership '” —_ becomes, which leads to fuzzier clusters. The cluster 
membership probability itself is calculated as follows: 





{he 7 pw 


tid) 
i) 
x! 


ama § 


[a], 





For example, 1f we chose three cluster centers as in the previous k-means example, 


f 
we could calculate the membership of the * sample belonging to the # cluster 
as follows: 


‘ = | ( Af) - 
) | |x —# 
| 


2 / ." 


(z) ( j) ‘ nt—| rar —| 


| e —y' ‘ 


(7) (1) 
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if : 
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The center“ of a cluster itself is calculated as the mean of all samples weighted 


is) 
by the degree to which each sample belongs to that cluster ( ): 


a a ytd ) yl) 
(J) i=] 


a . m(t. 7) 


yw 
7 ‘=I 
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Just by looking at the equation to calculate the cluster memberships, it is intuitive to 
say that each iteration in FCM 1s more expensive than an iteration in k-means. 
However, FCM typically requires fewer iterations overall to reach convergence. 
Unfortunately, the FCM algorithm is currently not implemented in scikit-learn. 
However, it has been found in practice that both k-means and FCM produce very 
similar clustering outputs, as described in a study (Comparative Analysis of k-means 
and Fuzzy C-Means Algorithms, S. Ghosh, and S. K. Dubey, IJACSA, 4: 35-38, 
2013). 
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Using the elbow method to find the optimal 
number of clusters 


One of the main challenges 1n unsupervised learning 1s that we do not know the 
definitive answer. We don't have the ground truth class labels in our dataset that 
allow us to apply the techniques that we used in Chapter 6, Learning Best Practices 
for Model Evaluation and Hyperparameter Tuning, in order to evaluate the 
performance of a supervised model. Thus, to quantify the quality of clustering, we 
need to use intrinsic metrics—such as the within-cluster SSE (distortion) that we 
discussed earlier in this chapter—to compare the performance of different k-means 
clusterings. Conveniently, we don't need to compute the within-cluster SSE 
explicitly when we are using scikit-learn, as it 1s already accessible via the inertia _ 
attribute after fitting a KMeans model: 


yor PEIN DISLOrtton: c.2t" ~ Kiistnereia ) 
Distortion: 72.48 


Based on the within-cluster SSE, we can use a graphical tool, the so-called elbow 
method, to estimate the optimal number of clusters 4 for a given task. Intuitively, we 
can say that, if k increases, the distortion will decrease. This is because the samples 
will be closer to the centroids they are assigned to. The idea behind the elbow 
method 1s to identify the value of k where the distortion begins to increase most 
rapidly, which will become clearer if we plot the distortion for different values of k: 


>>> distortions = [|] 
>>> for 1 in range(l, 11): 

km = KMeans(n Clusters=1, 
init='k-meanst+tt', 
Mh Ine = 10, 
max IGer—300, 
ee random state=0) 
a kim, fab (x) 
>>> CLSUTOLUILONS. append (Km.1nertid 7 
>>> plt.plot(range(1,11), distortions, marker="'o') 
Per DitsxXlanel( Number of Clusters” ) 
2o> plt.yLabel.(*Drstortion” ) 
>>> plt.show() 


As we can see in the following plot, the e/bow 1s located at k=3, which is evidence 
that A=3 is indeed a good choice for this dataset: 
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Quantifying the quality of clustering via 
silhouette plots 

Another intrinsic metric to evaluate the quality of a clustering is silhouette analysis, 
which can also be applied to clustering algorithms other than k-means that we will 
discuss later in this chapter. Silhouette analysis can be used as a graphical tool to plot 
a measure of how tightly grouped the samples 1n the clusters are. To calculate the 


silhouette coefficient of a single sample in our dataset, we can apply the following 
three steps: 


i) 
1. Calculate the cluster cohesion“ as the average distance between a sample 


wt) 
“and all other points in the same cluster. 


Li | 
2. Calculate the cluster separation b*” from the next closest cluster as the average 


(?) 
distance between the sample ** and all samples in the nearest cluster. 


Af) 
3. Calculate the silhouette * as the difference between cluster cohesion and 
separation divided by the greater of the two, as shown here: 


(i) bh —q") 


——_—_—S—S—S___" 
max |b", a") | 


The silhouette coefficient 1s bounded 1n the range -1 to 1. Based on the preceding 
equation, we can see that the silhouette coefficient 1s 0 if the cluster separation and 


| (i) _ of) | | 
cohesion are equal (5 — 4 ~. Furthermore, we get close to an ideal silhouette 

(i) fi) (i) | a | 
coefficient of lif 2 >> @, since P quantifies how dissimilar a sample is to 


ti) ee a 
other clusters, and “ tells us how similar it is to the other samples in its own 
cluster. 


The silhouette coefficient 1s available as silhouette samples from scikit-learn's 
metric module, and optionally, the silhouette scores function can be imported 
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for convenience. The silhouette scores function calculates the average silhouette 
coefficient across all samples, which 1s equivalent to 
numpy.mean(silhouette samples (..)). By executing the following code, we will 


ne 
now create a plot of the silhouette coefficients for a k-means clustering with K=3. 


por Ki. = KMeansS (fn CluUSTerS=3, 
init='k-meanst+t', 
i er Cay 
max iter=300, 
tol=le-O4, 

ie random state=0) 

ao P km = Kile Ft Preo1er.( x) 


>>> import numpy as np 
eo? Trom. MaLrpLotlib impor: cm 
Poo Trom Skiecarn.metrics 2mMpOre Si imoustre Samples 
Poe ClINSter labels = np.untque(y kim) 
yo Ih CU Stecs = Cluscer J. obele.onape |v) 
peo STtINOUCLLSG Vals = Sinouctce sanples(2, 
y_km, 
oes metric="'euclidean') 
>>> y ax lower, y ax upper = Q, 0 
>>> yticks = [] 
Pee LOG 2%. 2m CnUNerete (elLuUcter J2bels). 
© Siinouerce vals = sithoueccre valoly km == 
© SisnoOueree Vaelescorr() 
VY ex. Upper a= Jen(c 21 1nouetre vals) 
color = cm.jet(float(i) / n clusters) 
plt.barh(range(y ax lower, y ax upper), 
GC Silnouercte vals, 
height=1.0, 
edgecolor="none', 
COLOr=CGLOr) 
yticks.append((y ax lower + y ax upper) / 2.) 
ée% y ax lower += len(c silhouette vals) 
P>> S1..NOUeTUS: avg = Np.mean(si.nouertte vals) 
PO PLt«eaxV INS (Si. NOUSL Le avg, 
CO loOr="rea", 
ee linestyle="--") 
vee Pune oChe (Che, ClUSrer tobe lS a i) 
Por Plt.ylabel (*Cluster”*) 
>>> plt.xlabel ('Silhouette coefficient') 
Po Piles wow 


Through a visual inspection of the silhouette plot, we can quickly scrutinize the sizes 
of the different clusters and identify clusters that contain outliers: 


WOW! eBook 
www.wowebook.org 


0.1 0.2 0.3 0.4 0.5 0.6 
Silhouette coefficient 





However, as we can see in the preceding silhouette plot, the silhouette coefficients 
are not even close to 0, which 1s in this case an indicator of a good clustering. 
Furthermore, to summarize the goodness of our clustering, we added the average 
silhouette coefficient to the plot (dotted line). 


To see what a silhouette plot looks like for a relatively bad clustering, let's seed the 
k-means algorithm with only two centroids: 


vor Kin = BMeans:(n Clusters=Z, 
init='k-meanstt', 
Mm ana c=; 
max iter=300, 
tol=le-0O4, 

oes random. Stave=0) 

Peo J Kt = Kile lt Pregier (x) 


>o> plt.scatter (X[y_ km==0,0], 
X[y km==0,1], 
s=50, c='lightgreen', 
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edgecolor='"black', 
marker='s', 
oe label='cluster 1') 
Por PlLU.SCatter (xy Km==1, 0), 
X Ly km==1, 1], 
s=50, 
c='orange', 
edgecolor='"black', 
marker='o!', 
oe label='cluster 2") 
Pee Dil sSsCerlLer(kieClUuster Centers |i,Uls 


Kix CLUSTCYr Centers “|e y1 i), 
S=2590; 

marker='*', 

c='red', 


ee label='centroids') 
>>> plt.legend () 
POF DiteOGrid.() 
>>> plt.show() 


As we can see 1n the resulting plot, one of the centroids falls between two of the 
three spherical groupings of the sample points. Although the clustering does not look 
completely terrible, 1t is suboptimal: 
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: cluster 1 
elusteh 2 





Please keep in mind that we typically do not have the luxury of visualizing datasets 
in two-dimensional scatterplots in real-world problems, since we typically work with 
data in higher dimensions. So, next, we create the silhouette plot to evaluate the 
results: 


>>> cluster labels = np.unique(y_ km) 
Por MM elusters = Cluster. jJabels..shape (0) 
ve SLLNOUSTuS Vals = Siinouerve samples (x; 
y_km, 
oe metric="'euclidean') 
>>> y ax lower, y_ax upper = 0, 0 
2 ViAcks = |] 
Por TOL 1, © if ShuMmerave (cluster labels); 
GC. Siinouerve Vals = silhouctte valsly km == -c] 
C Silhouette Vals<sort{) 
y ax upper += len(c silhouette vals) 
color = cm.jet(i / n clusters) 
plt.barh(range(y ax lower, y ax upper), 
GC SiJnoustve Vals, 
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height=1.0, 

edgecolor='"none', 

COLOLr=COLOr) 
yticks.append((y ax lower + y ax upper) / 2) 
yoax lower r= Jen(¢. si.thouette vals) 

yee SiALNOUC TLS AVG = Np.mMiGan(silhouerre vals) 

Po Ol isdkv Line (S11 NOuSlLe avg; CoOlor="r60", lanescy le="-=") 
Per Dil«sVJEICkKS(ylicks, Cluster Jabels = 1) 

>>> plt.ylabel('Cluster') 

>>> plt.xlabel('Silhouette coefficient") 

>>> plt.show () 


As we can see in the resulting plot, the silhouettes now have visibly different lengths 
and widths, which is evidence for a relatively bad or at least suboptimal clustering: 


0.1 0.2 0.3 0.4 0.5 0.6 0.7 
Silhouette coefficient 
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Organizing clusters as a hierarchical 
tree 


In this section, we will take a look at an alternative approach to prototype-based 
clustering: hierarchical clustering. One advantage of hierarchical clustering 
algorithms is that it allows us to plot dendrograms (visualizations of a binary 
hierarchical clustering), which can help with the interpretation of the results by 
creating meaningful taxonomies. Another useful advantage of this hierarchical 
approach is that we do not need to specify the number of clusters up front. 


The two main approaches to hierarchical clustering are agglomerative and divisive 
hierarchical clustering. In divisive hierarchical clustering, we start with one cluster 
that encompasses all our samples, and we iteratively split the cluster into smaller 
clusters until each cluster only contains one sample. In this section, we will focus on 
agglomerative clustering, which takes the opposite approach. We start with each 
sample as an individual cluster and merge the closest pairs of clusters until only one 
cluster remains. 
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Grouping clusters in bottom-up fashion 


The two standard algorithms for agglomerative hierarchical clustering are single 
linkage and complete linkage. Using single linkage, we compute the distances 
between the most similar members for each pair of clusters and merge the two 
clusters for which the distance between the most similar members is the smallest. 
The complete linkage approach 1s similar to single linkage but, instead of comparing 
the most similar members in each pair of clusters, we compare the most dissimilar 
members to perform the merge. This is shown in the following diagram: 


Most similar members 
(single linkage) 





Most dissimilar members 
(complete linkage) 





Note 


Other commonly used algorithms for agglomerative hierarchical clustering include 
average linkage and Ward's linkage. In average linkage, we merge the cluster pairs 
based on the minimum average distances between all group members in the two 
clusters. In Ward's linkage, the two clusters that lead to the minimum increase of the 
total within-cluster SSE are merged. 
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In this section, we will focus on agglomerative clustering using the complete linkage 
approach. Hierarchical complete linkage clustering is an iterative procedure that can 
be summarized by the following steps: 


I. 
Zi 
oF 


4. 
2: 


Compute the distance matrix of all samples. 

Represent each data point as a singleton cluster. 

Merge the two closest clusters based on the distance between the most 
dissimilar (distant) members. 

Update the similarity matrix. 

Repeat steps 2-4 until one single cluster remains. 


Next, we will discuss how to compute the distance matrix (step 1). But first, let's 
generate some random sample data to work with: the rows represent different 
observations (IDs 0-4), and the columns are the different features (x, y, z) of those 
samples: 


>>> 
>>> 
yo 
Pe 
>>> 
>>> 
>>> 
>>> 


import pandas as pd 

import numpy as np 

np.random.seed (123) 

variables = ['X', 'Y', 'Z"] 

Labels = [10 0, to A710. 2", “ID, “1D 4a" 

xX = fp.rancom.random Sample (5,21) *10 

df = pd.DataFrame(X, columns=variables, index=labels) 
df 


After executing the preceding code, we should now see the following data frame 
containing the randomly generated samples: 





X ki Z 


6.964692 2.861393 2.268515 


9.013148 7.194690 4.231065 


9.807642 6.848297 4.809319 


3.921175 3.431780 7.290497 


4.385722 0.596779 3.980443 
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Performing hierarchical clustering on a distance 
matrix 


To calculate the distance matrix as input for the hierarchical clustering algorithm, we 
will use the pdist function from SciPy's spatial.distance submodule: 


>>> from scipy.spatial.distance import pdist, squareform 
Poe LOW O1SU. = pa.Datarrame (squaretorm 

pdist(df, metric='euclidean')), 

ee columns=labels, index=labels) 

yor GOW “Ole 


Using the preceding code, we calculated the Euclidean distance between each pair of 
sample points in our dataset based on the features x, y, and z. We provided the 
condensed distance matrix—treturned by pdist—as input to the squareform function 
to create a symmetrical matrix of the pair-wise distances as shown here: 


xX Y Z 
6.964692 2.861393 2.268515 
9.513148 7.194690 4.231065 


9.807642 6.848297 4.809319 


3.921175 3.481780 7.290497 


4.385722 0.596779 3.980443 





Next, we will apply the complete linkage agglomeration to our clusters using the 
linkage function from SciPy's cluster.hierarchy submodule, which returns a so- 
called linkage matrix . 


However, before we call the linkage function, let us take a careful look at the 
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function documentation: 


>>> from scipy.cluster.hierarchy import linkage 
>>> help (linkage) 


ei 
Paramevers ; 
y : ndarray 

A condensed or redundant distance matrix. A condensed 
distance matrix is a flat array containing the upper 
triangular of the distance matrix. This is the form 
that pdist returns. Alternatively, a collection of m 
observation vectors in n dimensions may be passed as 
anm by n array. 


method : str, optional 
The linkage algorithm to use. See the Linkage Methods 
section below for full descriptions. 


metric : str, optional 
The distance metric to use. See the distance.pdist 
function for a list of valid distance metrics. 


Returns: 
Z : ndarray 
The hierarchical clustering encoded as a linkage matrix. 


Based on the function description, we conclude that we can use a condensed distance 
matrix (upper triangular) from the pdist function as an input attribute. Alternatively, 
we could also provide the initial data array and use the 'euclidean' metric as a 
function argument in linkage. However, we should not use the squareform distance 
matrix that we defined earlier, since 1t would yield different distance values than 
expected. To sum it up, the three possible scenarios are listed here: 


e Incorrect approach: Using the squareform distance matrix shown in the 
following code snippet would lead to incorrect results: 


>>> from scipy.cluster.hierarchy import linkage 

poe TOW Clusters = linkage (row dist, 
method='complete', 
metric='euclidean') 


e Correct approach: Using the condensed distance matrix as shown in the 
following code example yields the correct pairwise distance matrix: 


Poo LOW Clusters = Linkage (poised, Mebric—"Ssiclicean), 
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method='complete') 


e Correct approach: Using the complete input sample matrix as shown in the 
following code snippet also leads to a correct distance matrix similar to the 
preceding approach: 


poe TOW Clusters = limkace(Cr.valucs, 
method='complete', 
metric='euclidean') 


To take a closer look at the clustering results, we can turn clustering results into a 
pandas DataFrame (best viewed in a Jupyter Notebook) as follows: 


Zor PO.Vatarrame(row Clusters; 
columns=['row label 1', 
"row label 2', 
'distance', 
‘no... Of Ltems in clust."],; 
index=['cluster @d' %(1+1) for 1 in 
range (row clusters.shape[0])]) 


As shown in the following screenshot, the linkage matrix consists of several rows 
where each row represents one merge. The first and second columns denote the most 
dissimilar members in each cluster, and the third row reports the distance between 
those members. The last column returns the count of the members in each cluster: 


row label1 rowlabel2 distance ono. of items in clust. 
cluster 1 0.0 4.0 3.835396 2.0 


cluster 2 1.0 2.0 4.347073 2.0 


cluster 3 3.0 9.0 9.899885 3.0 


cluster 4 6.0 7.0 8.316594 9.0 





Now that we have computed the linkage matrix, we can visualize the results in the 
form of a dendrogram: 


>>> from scipy.cluster.hierarchy import dendrogram 
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# make dendrogram black (part 1/2) 
? THOM SCipy.Cluster.lierarchy import sét Jink color palette 
# set link color palette(['black']) 
ver LOW OCnor = Cendrograem(row clusters, 
labels=labels, 
# make dendrogram black (part 2/2) 
# color threshold=np.inf 


oe ) 
Por PlLewtignt. Layout () 
>>> plt.ylabel ("Huclidean distance") 
>>> plt.show() 


If you are executing the preceding code or reading an ebook version of this book, 
you will notice that the branches in the resulting dendrogram are shown in different 
colors. The coloring scheme is derived from a list of Matplotlib colors that are 
cycled for the distance thresholds in the dendrogram. For example, to display the 
dendrograms in black, you can uncomment the respective sections that I inserted in 
the preceding code: 


wv 
W 
| 
( 
4) 
ae 
oO 
c 
( 
wv 
Oo 
WW 
= 
LLJ 





Such a dendrogram summarizes the different clusters that were formed during the 
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agglomerative hierarchical clustering; for example, we can see that the samples Ip 0 
and ip 4, followed by Ip 1 and Ip 2, are the most similar ones based on the 
Euclidean distance metric. 
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Attaching dendrograms to a heat map 


In practical applications, hierarchical clustering dendrograms are often used in 

combination with a heat map, which allows us to represent the individual values in 
the sample matrix with a color code. In this section, we will discuss how to attach a 
dendrogram to a heat map plot and order the rows in the heat map correspondingly. 


However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go 
through this procedure step by step: 


1. We create a new figure object and define the x axis position, y axis position, 
width, and height of the dendrogram via the add axes attribute. Furthermore, 
we rotate the dendrogram 90 degrees counter-clockwise. The code is as follows: 


>>> fig = plt.figure(figsize=(8,8), facecolor='white') 

er OnG, = PEGG. axes. 09 Usp eZ, Ue Ol) 

eo TOW Gencr = GCendrogram (row Clusters, OrLentation="lert”) 

>>> # note: for matplotlib < v1.5.1, please use orientation='right' 


2. Next, we reorder the data 1n our initial DataFrame according to the clustering 
labels that can be accessed from the dendrogram object, which is essentially a 
Python dictionary, via the leaves key. The code is as follows: 


ere GUL POwClust = OL. i loc |row.cenor|* leaves” | |2s=1) | 


3. Now, we construct the heat map from the reordered DataFrame and position it 
next to the dendrogram: 


Poe eo. = EEO sae xe V0.2 oy, Uses bye Cl) 
>>> Cax = axm.matshow(df rowclust, 
IM-erpOLation="nearest’, CMap="hor. t) 


4. Finally, we will modify the aesthetics of the dendrogram by removing the axis 
ticks and hiding the axis spines. Also, we will add a color bar and assign the 
feature and sample names to the x and y axis tick labels, respectively: 


Poor akOesel. XELCKS (|. I) 

eer exd«SeL_-yercks ((]) 

>>> for 1 in axd.spines.values(): 

iis Le-SCL Visitble (False) 

Por LiugeCOlLorbar (Cax) 

For eaxMsset XUICKlabels(|*"] - Jasttdt ,owelust.columns) 
yor ax. SeLu yeLecklabele, | **)] + Jase(dr Trowcluse.index)) 
Poe Pisce sshow () 
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After following the previous steps, the heat map should be displayed with the 
dendrogram attached: 
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As we can see, the order of rows in the heat map reflects the clustering of the 
samples in the dendrogram. In addition to a simple dendrogram, the color-coded 
values of each sample and feature in the heat map provide us with a nice summary of 
the dataset. 
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Applying agglomerative clustering via scikit- 
learn 


In the previous subsection, we saw how to perform agglomerative hierarchical 
clustering using SciPy. However, there is also an AgglomerativeClustering 
implementation 1n scikit-learn, which allows us to choose the number of clusters that 
we want to return. This is useful 1f we want to prune the hierarchical cluster tree. By 
setting the n cluster parameter to 3, we will now cluster the samples into three 
groups using the same complete linkage approach based on the Euclidean distance 
metric, as before: 


>>> from sklearn.cluster import AgglomerativeClustering 
pre GC = 2OCMOMerelt vec MmiSrer Ing th Clusters 5, 
affinity='euclidean', 
See linkage='complete') 
eee OOels = 26st Predicc(x) 
>>> print('Cluster labels: «s' % labels) 
Cluster labels: [1 0 0 2 1] 


Looking at the predicted cluster labels, we can see that the first and the fifth sample 
(ID 0 and ID 4) were assigned to one cluster (label 1), and the samples rtp 1 and 
ID 2 were assigned to a second cluster (label 0). The sample 1p 3 was put into its 
own cluster (label 2). Overall, the results are consistent with the results that we 
observed in the dendrogram. We shall note though that Ip 3 1s more similar to ID 4 
and Ip 0 thanto Ip 1 and Ip 2, as shown in the preceding dendrogram figure; this 1s 
not clear from scikit-learn's clustering results. Let's now rerun the 
AgglomerativeClustering using n cluster=2 in the following code snippet: 


por ac = AEG l|OmMerariveC lUusrering (im Clusters=Z, 
affinity='euclidean', 
baa linkage='complete') 
yo aoe Ss =-OC.11.0 PLedLce (x) 
>>> print('Cluster labels: «s' % labels) 
Cluster labels: [0 1 1 0 QO] 


As we can see, 1n this pruned clustering hierarchy, label Ip 3 was not assigned to the 
same cluster as ID 0 and ID 4, as expected. 
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Locating regions of high density via 
DBSCAN 


Although we can't cover the vast amount of different clustering algorithms in this 
chapter, let's at least introduce one more approach to clustering: Density-based 
Spatial Clustering of Applications with Noise (DBSCAN), which does not make 
assumptions about spherical clusters like k-means, nor does it partition the dataset 
into hierarchies that require a manual cut-off point. As its name implies, density- 
based clustering assigns cluster labels based on dense regions of points. In 
DBSCAN, the notion of density is defined as the number of points within a specified 





radius © . 


According to the DBSCAN algorithm, a special label is assigned to each sample 
(point) using the following criteria: 


e A point is considered a core point if at least a specified number (MinPts) of 
neighboring points fall within the specified radius © 
e A border point is a point that has fewer neighbors than MinPts within © , but 


lies within the © radius of a core point 
e All other points that are neither core nor border points are considered noise 
points 


After labeling the points as core, border, or noise, the DBSCAN algorithm can be 
summarized in two simple steps: 


1. Form a separate cluster for each core point or connected group of core points 


(core points are connected if they are no farther away than © ). 
2. Assign each border point to the cluster of 1ts corresponding core point. 


To get a better understanding of what the result of DBSCAN can look like before 
jumping to the implementation, let's summarize what we have just learned about core 
points, border points, and noise points in the following figure: 
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Core point 





MinPts = 3 


One of the main advantages of using DBSCAN is that it does not assume that the 
clusters have a spherical shape as in k-means. Furthermore, DBSCAN is different 
from k-means and hierarchical clustering in that it doesn't necessarily assign each 
point to a cluster but 1s capable of removing noise points. 


For a more illustrative example, let's create a new dataset of half-moon-shaped 
structures to compare k-means clustering, hierarchical clustering, and DBSCAN: 


>>> from sklearn.datasets import make moons 

>>> X, y = make moons(n samples=200, 
noise=0.05, 

se random state=0) 

Peo DLetseSCactter(X%[t,0), X)e,1]) 

Por Ppilti.«snow() 
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As we can see 1n the resulting plot, there are two visible, half-moon shaped groups 
consisting of 100 sample points each: 





We will start by using the k-means algorithm and complete linkage clustering to see 
if one of those previously discussed clustering algorithms can successfully identify 
the half-moon shapes as separate clusters. The code 1s as follows: 


Poo TL, (axl, axZ) = pltu.subplots(1, 2, 12951726E=(3, 3), 
Zor Kit = KMeans (i -Clusters=Z, 
eas random state=Q) 
por vy Kit = “KM. t it. Prediuce (x) 
Poor axleSCatlvter (xy km==0,0) 7 
X [y km==0, 1], 
c="'lightblue', 
edgecolor='"black', 
marker='o', 
s=40, 
a label='cluster 1') 
Por axl sSsCallter (xy. kKma=1, 0), 
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X[y_ km==1, Illy 
c='red', 
edgecolor='"black', 
marker='s', 
s=40, 
a label='cluster 2') 
yor eOxl-~seu. taLle( Kh-medne Clustering”) 
POP aC = FOG lLOomeraviveClustering (nh Clusters —Z, 
affinity='euclidean', 
ies linkage='complete') 
Pee ¥ ee = eC. Pree cr.) 
Poo @exZssCacver (xy .ac=—0; 0), 
Ry ey 1], 
c='lightblue', 
edgecolor='"black', 
marker='o', 
s=40, 
tae label='cluster 1") 
Peo AXAsscCaluer (uly aC==L,Ul, 
cy 2e==1, 1], 
c='red', 
edgecolor='"black', 
marker='s', 
s=40, 
— label='cluster 2') 
Por EX2,eeU. TAatLlet “Agglomerative Clustering”) 
>>> plt.legend () 
>>> plt.show() 


Based on the visualized clustering results, we can see that the k-means algorithm 1s 
unable to separate the two cluster, and also the hierarchical clustering algorithm was 
challenged by those complex shapes: 


K-means clustering Agglomerative clustering 
1.00 1.00 a Oo cluster 1 
0.75 07% @ cluster 2 
0.50 0.50 
0.25 0.25 
0.00 » 0.00 
—0).25 —0.25 
—0.50 —0.50 





-10 -05 06 65 16 435 20 





WOW! eBook 
www.wowebook.org 


Finally, let us try the DBSCAN algorithm on this dataset to see if 1t can find the two 
half-moon-shaped clusters using a density-based approach: 


>>> from sklearn.cluster import DBSCAN 
>>> db = DBSCAN (eps=0.2, 
min samples=5, 
24-3 metric='euclidean') 
vor y GO = Obstet. predicre (x) 
poe pee seco eee. 4 do =—=U, 4 
Xx [y dbo==-0, 1], 
c='lightblue', 
edgecolor='"black', 
marker='o', 
s=40, 
ose label='cluster 1") 
>>> plt.scatter(X[y db==1,0], 
Xx [y db==1, Lily 
C='FeQ" ; 
edgecolor='"black', 
marker='s', 
s=40, 
we label='cluster 2') 
>>> plt.legend () 
>>> plt.show() 


The DBSCAN algorithm can successfully detect the half-moon shapes, which 
highlights one of the strength of DBSCAN: clustering data of arbitrary shapes: 
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Oo cluster l 
@ cluster 2 





However, we shall also note some of the disadvantages of DBSCAN. With an 
increasing number of features in our dataset—assuming a fixed number of training 
examples—the negative effect of the curse of dimensionality increases. This is 
especially a problem if we are using the Euclidean distance metric. However, the 
problem of the curse of dimensionality 1s not unique to DBSCAN; it also affects 
other clustering algorithms that use the Euclidean distance metric, for example, k- 
means and hierarchical clustering algorithms. In addition, we have two 
hyperparameters in DBSCAN (MinPts and © ) that need to be optimized to yield 


good clustering results. Finding a good combination of MinPts and © can be 
problematic if the density differences in the dataset are relatively large. 


Note 


So far, we have seen three of the most fundamental categories of clustering 
algorithms: prototype-based clustering with k-means, agglomerative hierarchical 
clustering, and density-based clustering via DBSCAN. However, I also want to 
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mention a fourth class of more advanced clustering algorithms that we have not 
covered in this chapter: graph-based clustering. Probably the most prominent 
members of the graph-based clustering family are the spectral clustering 
algorithms. Although there are many different implementations of spectral 
clustering, they all have in common that they use the eigenvectors of a similarity or 
distance matrix to derive the cluster relationships. Since spectral clustering is beyond 
the scope of this book, you can read the excellent tutorial by Ulrike von Luxburg to 
learn more about this topic. (4 tutorial on spectral clustering, U. Von Luxburg, 
Statistics and Computing, 17(4): 395—416, 2007). It is freely available from arXiv at 


http://arxiv.org/pdt/0711.0189v1.pdf.- 


Note that, in practice, it is not always obvious which clustering algorithm will 
perform best on a given dataset, especially if the data comes in multiple dimensions 
that make it hard or impossible to visualize. Furthermore, it 1s important to 
emphasize that a successful clustering not only depends on the algorithm and its 
hyperparameters. Rather, the choice of an appropriate distance metric and the use of 
domain knowledge that can help guide the experimental setup can be even more 
important. 


In the context of the curse of dimensionality, it 1s thus common practice to apply 
dimensionality reduction techniques prior to performing clustering. Such 
dimensionality reduction techniques for unsupervised datasets include principal 
component analysis and RBF kernel principal component analysis, which we 
covered in Chapter 5, Compressing Data via Dimensionality Reduction. Also, it 1s 
particularly common to compress datasets down to two-dimensional subspaces, 
which allows us to visualize the clusters and assigned labels using two-dimensional 
scatterplots, which are particularly helpful for evaluating the results. 
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Summary 


In this chapter, you learned about three different clustering algorithms that can help 
us with the discovery of hidden structures or information in data. We started this 
chapter with a prototype-based approach, k-means, which clusters samples into 
spherical shapes based on a specified number of cluster centroids. Since clustering 1s 
an unsupervised method, we do not enjoy the luxury of ground truth labels to 
evaluate the performance of a model. Thus, we used intrinsic performance metrics 
such as the elbow method or silhouette analysis as an attempt to quantify the quality 
of clustering. 


We then looked at a different approach to clustering: agglomerative hierarchical 
clustering. Hierarchical clustering does not require specifying the number of clusters 
up front, and the result can be visualized in a dendrogram representation, which can 
help with the interpretation of the results. The last clustering algorithm that we saw 
in this chapter was DBSCAN, an algorithm that groups points based on local 
densities and 1s capable of handling outliers and identifying non-globular shapes. 


After this excursion into the field of unsupervised learning, it 1s now about time to 
introduce some of the most exciting machine learning algorithms for supervised 
learning: multilayer artificial neural networks. After their recent resurgence, neural 
networks are once again the hottest topic in machine learning research. Thanks to 
recently developed deep learning algorithms, neural networks are considered state- 
of-the-art for many complex tasks such as image classification and speech 
recognition. In Chapter 12, Implementing a Multilayer Artificial Neural Network 
from Scratch, we will construct our own multilayer neural network from scratch. In 
Chapter 13, Parallelizing Neural Network Training with TensorFlow, we will 
introduce powerful libraries that can help us to train complex network architectures 
most efficiently. 
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Chapter 12. Implementing a Multilayer 
Artificial Neural Network from Scratch 


As you may know, deep learning 1s getting a lot of attention from the press and 1s 
without any doubt the hottest topic in the machine learning field. Deep learning can 
be understood as a set of algorithms that were developed to train artificial neural 
networks with many layers most efficiently. In this chapter, you will learn the basic 
concepts of artificial neural networks so that you will be well-equipped for the 
following chapters, which will introduce advanced Python-based deep learning 
libraries and Deep Neural Network (DNN) architectures that are particularly well- 
suited for image and text analyses. 


The topics that we will cover in this chapter are as follows: 


e Getting a conceptual understanding of multilayer neural networks 

e Implementing the fundamental backpropagation algorithm for neural network 
training from scratch 

e Training a basic multilayer neural network for image classification 
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Modeling complex functions with 
artificial neural networks 


At the beginning of this book, we started our journey through machine learning 
algorithms with artificial neurons in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification. Artificial neurons represent the building blocks of the 
multilayer artificial neural networks that we will discuss in this chapter. The basic 
concept behind artificial neural networks was built upon hypotheses and models of 
how the human brain works to solve complex problem tasks. Although artificial 
neural networks have gained a lot of popularity in recent years, early studies of 
neural networks go back to the 1940s when Warren McCulloch and Walter Pitt first 
described how neurons could work. 


However, in the decades that followed the first implementation of the McCulloch- 
Pitt neuron model—Rosenblatt's perceptron in the 1950s, many researchers and 
machine learning practitioners slowly began to lose interest 1n neural networks since 
no one had a good solution for training a neural network with multiple layers. 
Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, 
G.E. Hinton, and R.J. Williams were involved in the (re)discovery and 
popularization of the backpropagation algorithm to train neural networks more 
efficiently, which we will discuss in more detail later in this chapter (Learning 
representations by back-propagating errors, David E. Rumelhart, Geoffrey E. 
Hinton, Ronald J. Williams, Nature, 323 (6088): 533-536, 1986). Readers who are 
interested in the history of Artificial Intelligence (AI), machine learning, and neural 
networks are also encouraged to read the Wikipedia article on A/ winter, which are 
the periods of time where a large portion of the research community lost interest 1n 
the study of neural networks (https://en.wikipedia.org/wiki/AI_winter). 


However, neural networks have never been as popular as they are today, thanks to 
the many major breakthroughs that have been made in the previous decade, which 
resulted in what we now call deep learning algorithms and architectures—neural 
networks that are composed of many layers. Neural networks are a hot topic not only 
in academic research but also in big technology companies such as Facebook, 
Microsoft, and Google, who invest heavily in artificial neural networks and deep 
learning research. As of today, complex neural networks powered by deep learning 
algorithms are considered state of the art when it comes to complex problem solving 
such as image and voice recognition. Popular examples of the products in our 
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everyday life that are powered by deep learning are Google's image search and 
Google Translate—an application for smartphones that can automatically recognize 
text in images for real-time translation into more than 20 languages. 


Many exciting applications of DNNs have been developed at major tech companies 
and the pharmaceutical industry as listed in the following, non-comprehensive list of 
examples: 


Facebook's DeepFace for tagging images (DeepFace: Closing the Gap to 
Human-Level Performance in Face Verification, Y. Taigman, M. Yang, M. 
Ranzato, and L. Wolf, IEEE Conference on Computer Vision and Pattern 
Recognition (CVPR), pages 1701—1708, 20/4) 

Baidu's DeepSpeech, which is able to handle voice queries 1n Mandarin 
(DeepSpeech: Scaling up end-to-end speech recognition, A. Hannun, C. Case, 
J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. 
Sengupta, A. Coates, and Andrew Y. Ng, arXiv preprint arX1v:1412.5567, 20/4) 
Google's new language translation service (Google's Neural Machine 
Translation System: Bridging the Gap between Human and Machine 
Translation, arXiv preprint arXiv:1412.5567, 2016) 

Novel techniques for drug discovery and toxicity prediction (Toxicity prediction 
using Deep Learning, T. Unterthiner, A. Mayr, G. Klambauer, and S. 
Hochreiter, arXiv preprint arX1v:1503.01445, 20/5) 

A mobile application that can detect skin cancer with an accuracy similar to 
professionally trained dermatologists (Dermatologist-level classification of skin 
cancer with deep neural networks, A. Esteva, B.Kuprel, R. A. Novoa, J. Ko, S. 
M. Swetter, H. M. Blau, and S.Thrun, in Nature 542, no. 7639, 2017, pages 115- 
118) 
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Single-layer neural network recap 


This chapter is all about multilayer neural networks, how they work, and how to train 
them to solve complex problems. However, before we dig deeper into a particular 
multilayer neural network architecture, let's briefly reiterate some of the concepts of 
single-layer neural networks that we introduced in Chapter 2, Training Simple 
Machine Learning Algorithms for Classification, namely, the ADAptive LInear 
NEuron (Adaline) algorithm, which 1s shown in the following figure: 
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In Chapter 2, Training Simple Machine Learning Algorithms for Classification, we 
implemented the Adaline algorithm to perform binary classification, and we used the 
gradient descent optimization algorithm to learn the weight coefficients of the model. 
In every epoch (pass over the training set), we updated the weight vector w using the 
following update rule: 


wi=w+Aw, where Aw=—nVJ/(w) 


In other words, we computed the gradient based on the whole training set and 
updated the weights of the model by taking a step into the opposite direction of the 
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gradient vI(w) . In order to find the optimal weights of the model, we optimized an 
objective function that we defined as the Sum of Squared Errors (SSE) cost 


VI (w) 


rate ‘/ , which we had to choose carefully to balance the speed of learning against the 
risk of overshooting the global minimum of the cost function. 


function . Furthermore, we multiplied the gradient by a factor, the learning 


In gradient descent optimization, we updated all weights simultaneously after each 


. —— Wwe, 
epoch, and we defined the partial derivative for each weight / in the weight vector 
w as follows: 


e J (w = -»> ( y ) _ gW) ses 


Cw. 





(ih ; _ {i} so 
Here, -’ is the target class label of a particular sample * , and @ is the 
activation of the neuron, which is a linear function in the special case of Adaline. 


Furthermore, we defined the activation function ( ) as follows: 


¢(z)=z=a 


Here, the net input z 1s a linear combination of the weights that are connecting the 
input to the output layer: 


fi 
Z= WAS = A 
j Jd 


While we used the activation AG to compute the gradient update, we implemented 
a threshold function to squash the continuous valued output into binary class labels 
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for prediction: 


{lif g(z)20 


y= | 
| —| otherwise 


Note 


Note that although Adaline consists of two layers, one input layer and one output 
layer, it is called single-layer network because of its single link between the input 
and output layers. 


Also, we learned about a certain trick to accelerate the model learning, the so-called 
stochastic gradient descent optimization. Stochastic gradient descent approximates 
the cost from a single training sample (online learning) or a small subset of training 
samples (mini-batch learning). We will make use of this concept later in this chapter 
when we implement and train a multilayer perceptron. Apart from faster learning— 
due to the more frequent weight updates compared to gradient descent—its noisy 
nature is also regarded as beneficial when training multilayer neural networks with 
non-linear activation functions, which do not have a convex cost function. Here, the 
added noise can help to escape local cost minima, but we will discuss this topic in 
more detail later in this chapter. 
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Introducing the multilayer neural network 
architecture 


In this section, you will learn how to connect multiple single neurons to a multilayer 
feedforward neural network; this special type of fully connected network 1s also 
called Multilayer Perceptron (MLP). The following figure illustrates the concept 
of an MLP consisting of three layers: 


(input layer in) (hidden layer h) (output layer out) 





The MLP depicted in the preceding figure has one input layer, one hidden layer, and 
one output layer. The units in the hidden layer are fully connected to the input layer, 
and the output layer is fully connected to the hidden layer. If such a network has 
more than one hidden layer, we also call it a deep artificial neural network. 


Note 


We can add an arbitrary number of hidden layers to the MLP to create deeper 
network architectures. Practically, we can think of the number of layers and units in 
a neural network as additional hyperparameters that we want to optimize for a given 
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problem task using cross-validation techniques that we discussed in Chapter 6, 
Learning Best Practices for Model Evaluation and Hyperparameter Tuning. 


However, the error gradients that we will calculate later via backpropagation will 
become increasingly small as more layers are added to a network. This vanishing 
gradient problem makes the model learning more challenging. Therefore, special 
algorithms have been developed to help train such deep neural network structures; 
this is known as deep learning. 


As shown in the preceding figure, we denote the ith activation unit in the /th layer as 
MD 
' . To make the math and code implementations a bit more intuitive, we will not 
use numerical indices to refer to layers, but we will use the in superscript for the 
input layer, the / superscript for the hidden layer, and the o superscript for the output 
_ (in| (i) 
layer. For instance, “refers to the ith value in the input layer, ‘ refers to the 
fant) 
ith unit in the hidden layer, and “ refers to the 7th unit in the output layer. Here, 
_ (in| (hy) 
the activation units “° and ° are the bias units, which we set equal to 7. The 
activation of the units in the input layer 1s just its input plus the bias unit: 


(in) | 


ar 
(in) (in ) 
( z77 ) ae Ay 
(in) Ain) 
a | Xn 
Note 


Later in this chapter, we will implement the multilayer perceptron using separate 
vectors for the bias unit, which makes the code implementation more efficient and 
easier to read. This concept 1s also used by TensorFlow, a deep learning library that 
we will introduce in Chapter 13, Parallelizing Neural Network Training with 
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TensorFlow. However, the mathematical equations that will follow, would appear 
more complex or convoluted if we had to work with additional variables for the bias. 
However, note that the computation via appending Is to the input vector (as shown 
previously) and using a weight variable as bias is exactly the same as operating with 
separate bias vectors; it is merely a different convention. 


Each unit in layer / 1s connected to all units in layer i+] viaa weight coefficient. 

For example, the connection between the Ath unit in layer / to the jth unit in layer 
(1) 

i+! will be written as ©. Referring back to the previous figure, we denote the 


r(h) 
weight matrix that connects the input to the hidden layer as W , and we write the 


lout) 
matrix that connects the hidden layer to the output layer as HM : 


While one unit in the output layer would suffice for a binary classification task, we 
saw amore general form of a neural network in the preceding figure, which allows 
us to perform multiclass classification via a generalization of the One-versus-All 
(OvA) technique. To better understand how this works, remember the one-hot 
representation of categorical variables that we introduced in Chapter 4, Building 
Good Training Sets — Data Preprocessing. For example, we can encode the three 
class labels in the familiar Iris dataset (0O=Setosa, 1=Versicolor, 2=Virginica) as 
follows: 


| () 0) 
= 10) =|) | 2=] 0 
0 () | 





This one-hot vector representation allows us to tackle classification tasks with an 
arbitrary number of unique class labels present in the training set. 


If you are new to neural network representations, the indexing notation (subscripts 
and superscripts) may look a little bit confusing at first. What may seem overly 
complicated at first will make much more sense in later sections when we vectorize 
the neural network representation. As introduced earlier, we summarize the weights 
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that connect the input and hidden layers by a matrix wer eR , where d 1s the 
number of hidden units and m is the number of input units including the bias unit. 
Since it is important to internalize this notation to follow the concepts later in this 
chapter, let's summarize what we have just learned in a descriptive illustration of a 
simplified 3-4-3 multilayer perceptron: 
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Input layer with 3 Hidden layer with 4 
input units plus hidden units plus bias 


bias unit (m = 3+1) unit (d = 4+1) Ourpus layer 


with 3 output 


oO Oo units (t = 3) 
KSC 
TORSO 
—TAABYXN O LXV ©@ 
® 








t 
yw (ou M3 
connects I|‘t non-bias neuron in the 2"9 
layer (hidden layer h) to the 3"¢ unit in 


Number of layers: L = 3 the 3 layer (output layer out) 
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Activating a neural network via forward 
propagation 


In this section, we will describe the process of forward propagation to calculate the 
output of an MLP model. To understand how it fits into the context of learning an 
MLP model, let's summarize the MLP learning procedure in three simple steps: 


1. Starting at the input layer, we forward propagate the patterns of the training data 
through the network to generate an output. 

2. Based on the network's output, we calculate the error that we want to minimize 
using a cost function that we will describe later. 

3. We backpropagate the error, find its derivative with respect to each weight in 
the network, and update the model. 


Finally, after we repeat these three steps for multiple epochs and learn the weights of 
the MLP, we use forward propagation to calculate the network output and apply a 
threshold function to obtain the predicted class labels in the one-hot representation, 
which we described in the previous section. 


Now, let's walk through the individual steps of forward propagation to generate an 

output from the patterns 1n the training data. Since each unit in the hidden layer 1s 

connected to all units in the input layers, we first calculate the activation unit of the 
_(h) 


hidden layer | as follows: 
lh) (in). (Rr) (ety, AA) (in ) TAA) 
| Ay M 0,1 + aa Wit + a a, M m1 
(4) — 2 (4) 
dl, — b(z, 
Bs (ir) 7 (-) 
Here, ‘1s the net input and ' 1s the activation function, which has to be 
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differentiable to learn the weights that connect the neurons using a gradient-based 
approach. To be able to solve complex problems such as image classification, we 
need non-linear activation functions in our MLP model, for example, the sigmoid 
(logistic) activation function that we remember from the section about logistic 
regression in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn: 


b(z)=— 


I+e~- 


As we can remember, the sigmoid function is an S-shaped curve that maps the net 
input z onto a logistic distribution in the range 0 to 1, which cuts the y-axis at z = 0, 
as shown in the following graph: 





MLP 1s a typical example of a feedforward artificial neural network. The term 
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feedforward refers to the fact that each layer serves as the input to the next layer 
without loops, in contrast to recurrent neural networks—an architecture that we will 
discuss later in this chapter and discuss in more detail in Chapter 16, Modeling 
Sequential Data Using Recurrent Neural Networks. The term multilayer perceptron 
may sound a little bit confusing since the artificial neurons in this network 
architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of 
the neurons in the MLP as logistic regression units that return values in the 
continuous range between 0 and 1. 


For purposes of code efficiency and readability, we will now write the activation in a 
more compact form using the concepts of basic linear algebra, which will allow us to 
vectorize our code implementation via NumPy rather than writing multiple nested 
and computationally expensive Python for loops: 


aN") = Qoy 
(h) f(a)" 
if =$(z 


rr | 


. (i) 
Here, “ is our / x m dimensional feature vector of asample * __ plus a bias unit. 


yO is an m x d dimensional weight matrix where d is the number of units in the 

hidden layer. After matrix-vector multiplication, we obtain the / x d dimensional net 
_(h) [ fr) _(h) » Lad 

input vector “ to calculate the activation “ (where © IR ). Furthermore, 

we can generalize this computation to all n samples in the training set: 


lim) . ee ; 
Here, A is now ann x m matrix, and the matrix-matrix multiplication will result 
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L 


; . ; ; _ ee —— 
in an 2 x d dimensional net input matrix Ya Finally, we apply the activation 


function b( ) to each value in the net input matrix to get the n x d activation matrix 


(hr) 
A” for the next layer (here, the output layer): 


al’) —. o(z\” 


Similarly, we can write the activation of the output layer in vectorized form for 
multiple samples: 


. 


7 | out ) = A h Ww (out ) 


[ owt ) 
Here, we multiply the d x ¢ matrix W (t is the number of output units) by the n x 


A' in) ae 
d dimensional matrix to obtain the n x ¢t dimensional matrix (the 
columns in this matrix represent the outputs for each sample). 


Lastly, we apply the sigmoid activation function to obtain the continuous valued 
output of our network: 


Pama = o( zr" Fas c leo 
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Classifying handwritten digits 


In the previous section, we covered a lot of the theory around neural networks, which 
can be a little bit overwhelming if you are new to this topic. Before we continue with 
the discussion of the algorithm for learning the weights of the MLP model, 
backpropagation, let's take a short break from the theory and see a neural network in 
action. 


Note 


The neural network theory can be quite complex, thus I want to recommend two 
additional resources, which cover some of the concepts that we discuss in this 
chapter in more detail: 


e Chapter 6, Deep Feedforward Networks, Deep Learning, I. Goodfellow, Y. 
Bengio, and A. Courville, MIT Press, 2016. (Manuscripts freely accessible at 
http://www.deeplearningbook.org. ) 

e Pattern Recognition and Machine Learning, C. M. Bishop and others, Volume 
1. Springer New York, 2006. 


In this section, we will implement and train our first multilayer neural network to 
classify handwritten digits from the popular Mixed National Institute of Standards 
and Technology (MNIST) dataset that has been constructed by Yann LeCun and 
others, and serves as a popular benchmark dataset for machine learning algorithms 
(Gradient-Based Learning Applied to Document Recognition, Y. LeCun, L. Bottou, 
Y. Bengio, and P. Haffner, Proceedings of the IEEE, 86(11): 2278-2324, November 
1998). 


WOW! eBook 
www.wowebook.org 


Obtaining the MNIST dataset 


The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and 
consists of the following four parts: 


e Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB 


unzipped, and 60,000 samples) 
e Training set labels: train-labels-idxl-ubyte.gz (29 KB, 60 KB unzipped, 


and 60,000 labels) 

e Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, unzipped and 
10,000 samples) 

e Test set labels: t10k-labels-idx1l-ubyte.gz (5 KB, 10 KB unzipped, and 
10,000 labels) 


The MNIST dataset was constructed from two datasets of the US National Institute 
of Standards and Technology (NIST). The training set consists of handwritten 
digits from 250 different people, 50 percent high school students, and 50 percent 
employees from the Census Bureau. Note that the test set contains handwritten digits 
from different people following the same split. After downloading the files, I 
recommend that you unzip the files using the Unix/Linux gzip tool from the 
Terminal for efficiency, using the following command in your local MNIST 
download directory: 


gzip *ubyte.gz -d 


Alternatively, you could use your favorite unzipping tool if you are working with a 
machine running on Microsoft Windows. The images are stored in byte format, and 
we will read them into NumPy arrays that we will use to train and test our MLP 
implementation. In order to do that, we will define the following helper function: 


import os 
IMSOLe SCLUCL 
import numpy as np 


coef Joad Matse(path, Kind="Crain”)-; 
"""Toad MNIST data from ‘path’ ""™" 
Labels path = OSspalh.)JO1n(pach, 
'Ss-labels-idxl-ubyte' % kind) 
images path = OS.path.)oO1n(pacn, 
'Ss-images-1dx3-ubyte' % kind) 


With, Opent labels path, “to") as Loparth: 
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Magic, Nn. = SLLUCE.unpack("’>1ii", 
lbpath. read (8) ) 
labels = np.fromfile(lbpath, 
dtype=np.uints8) 


with open(images path, ‘'rb') as imgpath: 
Mmagi¢, Nin, rows, Cols = Struct.unpack("2i1ii", 
imgpath.read(16) ) 
images = np.fromfile(imgpath, 
dtype=np.uinté8) .reshape ( 
len(labels), 784) 
images = ((images / 255.) - .5) * 2 


return images, labels 


The load mnist function returns two arrays, the first being an n x m dimensional 
NumPy array (images), where n is the number of samples and m 1s the number of 
features (here, pixels). The training dataset consists of 60,000 training digits and the 
test set contains 10,000 samples, respectively. The images in the MNIST dataset 
consist of 28 x 28 pixels, and each pixel is represented by a gray scale intensity 
value. Here, we unroll the 28 x 28 pixels into one-dimensional row vectors, which 
represent the rows in our images array (784 per row or image). The second array 
(labels) returned by the load mnist function contains the corresponding target 
variable, the class labels (integers 0-9) of the handwritten digits. 


The way we read in the image might seem a little bit strange at first: 


>>> magic, n = struct.unpack('>II', lbpath.read(8)) 
>>> labels = np.fromfile(lbpath, dtype=np.ints8) 


To understand how those two lines of code work, let's take a look at the dataset 
description from the MNIST website: 


[offset] [type] [value] [description] 

OO000 32 bit integer Ox00000801(2049) magic number (MSB first) 
0004 32 bit integer 60000 number of items 

0008 unsigned byte 22 label 

0009 unsigned byte oe label 

OO. .4 unsigned byte 2° label 


Using the two preceding lines of code, we first read in the magic number, which 1s a 
description of the file protocol as well as the number of items (n) from the file buffer 
before we read the following bytes into a NumPy array using the fromfile method. 
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The fmt parameter value '>r1' that we passed as an argument to struct. unpack can 
be composed into the two following parts: 


e >: This is big-endian—it defines the order in which a sequence of bytes is 
stored; if you are unfamiliar with the terms big-endian and little-endian, you can 
find an excellent article about Endianness on Wikipedia: 


https://en.wikipedia.org/wiki/Endianness 
e 1: This is an unsigned integer 


Finally, we also normalized the pixels values in MNIST to the range -1 to 1 
(originally 0 to 255) via the following code line: 


images = ((images / 255.) - .5) * 2 


The reason behind this is that gradient-based optimization is much more stable under 
these conditions as discussed in Chapter 2, Training Simple Machine Learning 
Algorithms for Classification. Note that we scaled the images on a pixel-by-pixel 
basis, which is different from the feature scaling approach that we took 1n previous 
chapters. Previously, we derived scaling parameters from the training set and used 
these to scale each column in the training set and test set. However, when working 
with image pixels, centering them at zero and rescaling them to a [-1, 1] range is also 
common and usually works well 1n practice. 


Note 


Another recently developed trick to improve convergence in gradient-based 
optimization through input scaling is batch normalization, which is an advanced 
topic that we will not cover in this book. However, if you are interested in deep 
learning applications and research, I highly recommend that you read more about 
batch normalization in the excellent research article Batch Normalization: 
Accelerating Deep Network Training by Reducing Internal Covariate Shift by Sergey 
loffe and Christian Szegedy (2015, https://arxiv.org/abs/1502.03 167). 


By executing the following code, we will now load the 60,000 training instances as 
well as the 10,000 test samples from the local directory where we unzipped the 
MNIST dataset (in the following code snippet, it 1s assumed that the downloaded 
MNIST files were unzipped to the same directory in which this code was executed): 


vor x, Chol, VY eran. — 10a mist (  *, Kind="Traan”.) 
o> > prin (Rows: 2d, columns: <a’ 
* (X Urain.snepe|0l; ~ ereain.shape| 1] )) 
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Rows: 60000, columns: 784 


Pe a 
>>> 


he weet, VY Gece: = L0aed (atest *, hand] e100) 
Print (* Rows: «dd, Columns: <d' 
« (xX Test.shape (0), x. Gest.shepe| 1] )) 


Rows: 10000, columns: 784 


To get an idea of how those images in MNIST look, let's visualize examples of the 
digits 0-9 after reshaping the 784-pixel vectors from our feature matrix into the 
original 28 x 28 image that we can plot via Matplotlib's imshow function: 


>>> 
>>> 
>>> 
>>> 


Ss 
>>> 
>>> 
>>> 


Import. matplotlib.pyplot as: pit 


fig, ax = plt.subplots (nrows=2, ncols=5, 
sharex=True, sSharey=True) 


ax = ax.flatten() 

for 1 in range(10): 
img = xX Crainly tram == 1) 10lereshape(z2c, 2G) 
ax[i].imshow(img, cmap='Greys') 

ax[O].set xticks([]) 


ax[0].set yticks([]) 
PLEstIGnke. tayour |) 
plt.show () 


We should now see a plot of the 2 x 5 subfigures showing a representative image of 
each unique digit: 
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In addition, let's also plot multiple examples of the same digit to see how different 
the handwriting really is: 


>>> fig, ax = plt.subplots (nrows=5, 
nCOlLS=5, 
sharex=True, 
oes sharey=True) 
>>> ax = ax.flatten() 
>>> for 1 in range(25): 
img = xX trainly train == 7] lajsreshape( zc, 28) 
ee4 ax[i].imshow(img, cmap='Greys') 
Po? ax 0) «Gel. SULeCKS ( |i) 
PoP aklO)| «see YuleKs (1 I) 
pee Pee et one. ayoul |) 
>>> plt.show() 


After executing the code, we should now see the first 25 variants of the digit 7: 
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After we've gone through all the previous steps, it is a good idea to save the scaled 
images in a format that we can load more quickly into a new Python session to avoid 
the overhead of reading in and processing the data again. When we are working with 
NumPy arrays, an efficient yet most convenient method to save multidimensional 
arrays to disk is NumPy's savez function (the official documentation can be found 


here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html). 


In short, the savez function is analogous to Python's pickle module that we used in 
Chapter 9, Embedding a Machine Learning Model into a Web Application, but 
optimized for storing NumPy arrays. The savez function produces zipped archives 
of our data, producing .npz files that contain files in the .npy format; if you want to 
learn more about this format, you can find a nice explanation, including a discussion 
about advantages and disadvantages, in the NumPy documentation: 
https://docs.scipy.org/doc/numpy/neps/npy-format.html. Further, instead of using 
savez, we will use savez_ compressed, which uses the same syntax as savez, but 
further compresses the output file down to substantially smaller file sizes 
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(approximately 22 MB versus approximately 400 MB in this case). The following 
code snippet will save both the training and test datasets to the archive file 


"MLS tC. SCaled.npz’. 
>>> import numpy as np 


eo VpaSaves COMpressed (’mMn1st scaled.fpz*; 
xX Cvein=x Crain, 
Vole 7 aaa, 
A UCSl=4 Lest, 
VY ‘L6st=y_ Les) 


After we created the .npz files, we can load the preprocessed MNIST image arrays 
using NumPy's 1o0ad function as follows: 

Por MAS t. =. Tip,1o0ao( MniLst. scaled. npz” } 

The mnist variable now references to an object that can access the four data arrays 


as we provided them keyword arguments to the savez compressed function, which 
are listed under the files attribute list of the mnist object: 


>>> mnist.files 
2 Eien", “YY iat, *% Cest"y “yy Lest”! 
For instance, to load the training data into our current Python session, we will access 


the 'x_ train' array as follows (similar to a Python dictionary): 


Por ew tain = Mase |x. eraim” 


Using a list comprehension, we can retrieve all four data arrays as follows: 


Poe tela. VY ool, 2. boot, 7 teoee = loc le| Loe 
f in mnist.files] 


Note that while the preceding np.savez compressed and np.1load examples are not 
essential for executing the code in this chapter, it serves as a demonstration of how to 
save and load NumPy arrays conveniently and efficiently. 
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Implementing a multilayer perceptron 


In this subsection, we will now implement the code of an MLP with one input, one 
hidden, and one output layers to classify the images in the MNIST dataset. I have 
tried to keep the code as simple as possible. However, it may seem a little bit 
complicated at first, and I encourage you to download the sample code for this 
chapter from the Packt Publishing website or from GitHub 
(https://github.com/rasbt/python-machine-learning-book-2nd-edition) so that you can 
view this MLP implementation annotated with comments and syntax highlighting for 
better readability. 


If you are not running the code from the accompanying Jupyter Notebook file or 
don't have access to the internet, I recommend that you copy the NeuralNetMLP code 
from this chapter into a Python script file in your current working directory, for 
example, neuralnet.py, which you can then import into your current Python session 
via the following command: 


from neuralnet import NeuralNetMLP 


The code will contain parts that we have not talked about yet, such as the 
backpropagation algorithm, but most of the code should look familiar to you based 
on the Adaline implementation in Chapter 2, 7raining Simple Machine Learning 
Algorithms for Classification, and the discussion of forward propagation in earlier 
sections. 


Do not worry if not all of the code makes immediate sense to you; we will follow up 
on certain parts later in this chapter. However, going over the code at this stage can 
make it easier to follow the theory later. 


The following is the implementation of a multilayer perceptron: 


import numpy as np 
import sys 


class NeuralNetMLP (object): 
mm" Peedforward neural network / Multi-layer perceptron classifier. 


Parameters 


nh NtOosen + ant. (cetaults 30) 
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Number of hidden units. 
12 : float (default: OQ.) 
Lambda value for L2-regularization. 
No regularization if 12=0. (default) 
epochs : int (default: 100) 
Number of passes over the training set. 
eta : float (default: 0.001) 
Learning rate. 
shuffle : bool (default: True) 
Shuffles training data every epoch 
if True to prevent circles. 


MintibatCh Size = ante {defaults 1) 
Number of training samples per minibatch. 
seed : int (default: None) 


Random seed for initializing weights and shuffling. 


Attributes 

eval. = dice 
Dictionary collecting the cost, training accuracy, 
and validation accuracy for each epoch during training. 


def £=ainit (self, n_ hidden=30, 
12=0., epochs=100, eta=0.001, 
shulile=True, Minibacch Gize—l, seec—None) ; 


self.random = np.random.RandomState (seed) 
selien Didcem = n_hNicoen 

self.12 = 12 

self.epochs = epochs 

self.eta = eta 

self.shuffle = shuffle 
SelisMift Daten size = Minibetcn S176 


cet Onenot(selt, VY, M-classes) : 
""uBncode labels into one-hot representation 


Parameters 


yy © array, Shape = | semples| 
Target values. 


Returns 


OneNote. ¥ array, Sheps = (Nn. Samples, mm. babels) 


wesw 
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deft 


def 


deft 


OMeHOt = fps4e Os (io Classes, Vs orev), 

for idx, val in enumerate(y.astype(int)): 
onehot[val, idx] = 1. 

return onehot.tT 


@lLOMO1at(selr, °2)% 
meV COMmpuce JOGLStTEC EuNCE1ON (Saqmo1d) 
return 1. / (1. + np.exp(-np.clip(z, -250, 250))) 


_fOrwWaro(Seli, A): 
"""Compute forward propagation step""" 


# step 1: net input of hidden layer 

# [n samples, n features] dot [n features, n hidden] 
# -> [n samples, n hidden] 

ZW = DNp6GQ0C(x, SeClivw 2)  Sseliyo 7 


# step 2: activation of hidden layer 
al) = Selt, 35190m010,(7. 1D) 


# step 3: net input of output layer 
# [n samples, n hidden] dot [n hidden, n_ classlabels] 
# —-> [n samples, n classlabels] 


Z Ole. = Tp.d0tie Dy SClisew OUL) = Sell.) Our 


# step 4: activation output layer 
a. OUlL = selt. Sigmo1rd(7 Our) 


Pocurn 2 ol, @. Mp FZ OUL, 4 OUT 


Compute Cosvt(selt, Y.enc, Output): 

ureCompucre cost function. 

Parameters 

VY One | abray,; Shape = (nm .Samples, N..lavels) 
one-hot encoded class labels. 

OULPUL = arfay, Shape = [n. samples, mM OULDUL UNITS) 
Activation of the output layer (forward propagation) 

ReELULIS 


Coste = Dloat 
Regularized cost 


wesw 


2 Cem = (SelizkZ * 
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(Np. Sumtseli.w ee 2a) 7 
Hp sum(selt.w Out ** 7.) 


) ) 


tern. = —y ene * (np<lOg (OurtpUuL) 
CormiZ = (il. ~ Vienne) ~ mos log tls = OUuLDUT) 
cost = Np.sum(terml — termZ) «+ LIZ verm 


return cost 


def predict(self, X): 


deft 


""™MPredict class labels 


Parameters 


xX =< avray, shape = |[n Samples, nm teatures| 
Input layer with original features. 


ReELCUrhis: 


y prea ; array; Shape >= [n. samples) 
Predicted class labels. 


Z Dy @ Dy Z.OUL, a2 -0ue = Selt. LOrweara(.) 
y pred = np.argmax(z out, axis=1) 
FeLcurn Y pred 


LLCs 2. ete, Vira, © Valle, Y Valia): 
mum Learn weights from training data. 


ParaneLers 

A -Trein ¢ airey, shape = [ih samples, D Tealtures| 
Input layer with original features. 

y train : array, shape = [n samples] 
Target class labels. 

X Valad 3; array, shape = [n_ samples, nm features] 
Sample features for validation during training 

VY VeliG 4 afray, shape = [fn Samples) 


Sample labels for validation during training 


Returns: 


wesw 


n output = np.unique(y train).shape[0] # no. of class 
#labels 
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I. Peecures = x Traln.ciape |i) 


tH EHH HHH HE EH HH HH HE EE EF 
# Weight initialization 


fat at at ae ae a a ae ae ae ae ae Ea 


# weights for input -> hidden 
SelivdD Tt = 1p.2e6r0s (selt.m Hadden) 
Sseliew fh = Selt.7anoom.notmal (loC=0.0, scale-—U.1, 
SiZz6=(n features, 
Selim adden) 


# weights for hidden -> output 


Seles OU = Tile Z-6o 4) OUucouT) 
Sscliaw OGL = Sselt.rencdom. normal (1Loc=0.0, scale=0..1, 
size=(self.n hidden, 
i OULU) 
epoch: strlen — lénis _tr(selt.epochs)) # Lor progr. Tormac. 
self.eval = {'cost': [], ‘train acc': [], ‘valid acc': \ 
ab 

VY Crein ene = Selt,. Onenoll(y train, Nn Ooureur) 
# iterate over training epochs 
for 1 in range(self.epochs): 

# iterate over minibatches 

Indices = Np.arange(% Crain.shape|0)) 


if self.shuffle: 
self.random.shuffle (indices) 


for start idx in range(0, indices.shape[0] -\ 
self.minibatch size +\ 
ly, Selisminibartcn size): 
Daten wax = andices (stare 1dxsscrare scx +) 
Sseli sma beaten. S176] 


# forward propagation 
Z hy a Dy Z Our, 2 Cul = \ 
self, POrwaro4x% train ibatccy 10x] ) 


iat tet te ee ee 
# Backpropagation 
iat ett ee ee ee 
# [n samples, n classlabels] 
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SlLgMea Out = 2. OU = VY Crain enc|barch 10x! 


# [n samples, n hidden] 
SLGMO1G CStiVative fh =.a 2 (ls =a i) 


# [n samples, n classlabels] dot [n classlabels, 


ir n hidden] 
# -> [n samples, n hidden] 
Sigma = (np.doCtstoma OU, Selly OUt.7) * 


sigmoid Oerivative |n) 


# [n features, n samples] dot [n samples, 

ir n hidden] 

# -> [n features, n hidden] 

Grad. woh = np cot(x train |(baevch 10x%)].T, Ssiomea 1) 
Graco H. = 1p .Sum- (sigma. A, 2x tS=0) 


# [n hidden, n samples] dot [n samples, 


# nm. Classtlabels! 
# —-> [n hidden, n classlabels] 

grad _w out = np.dot(a_h.T, sigma out) 
Gracd. 6 OUL = Nnp.sum(Sigma OUL, .ax1S=)) 


# Regularization and weight updates 
eclta Wh = (Grad. wit a Selt.i2*selt Wi) 


delta Ob bf = gored bl 7 bias 16 nol regularized 
SelieWw © == Seliseia * Celia WwW a 
Sclish) 1 == Sell seve * Cella © Ff 
celta w out = (Grad w out + self.t2*selr.w out) 
delta b..Out = grad. b.0our 7 bias 16 Mot regularized 
S6lt.W OU == Sellseta * Cella Ww OUL 
Sclt.D. OUL “= Seli.<eta ~ celta Db Out 
it ar at at at at at AF aE A at Ht 
# Evaluation 
it ar at at at at at aE at at Ht 
# Evaluation after each epoch during training 
2. ty @ My Z OU, @ OUL. = Selt, Torwara(x Crain) 
Cost = Selzt, -COMpULSe Costly enc=) Crain enc, 


CuULPUL—4,. OuUL) 


VY ttain pred = selt.predicr(x% train) 
VY Velic: pred. = seli.proecice(x% valid) 


trot Sce-=] (ip eeUnty ea: =— 
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y train pred)).astype(np.float) / 
x train. shape | 0])) 
VoliIc- ace: = {(npssum(y valid == 
y valid pred)).astype(np.float) / 
X valid.shape[0]) 


oP 


/ 


oP 


“ie 


ol? 


sys.stderr.write('\r%0*d/%sd | Cost: %.2f£ ' 
| retn/ Valid AGe.* 2.25 s 


(epocn. Strilen, atl, selt.,epocis,; 

COSL, 

trai m..ace* 100, valid, ace*L00) } 
sys.stderr.flush() 


SGit«eVa | “COSt” | sappena( Cost) 
Sseltseval ("train ace” | sappend (train ace) 
Seli.eval .|"valia acc’ ] ,append(vali a: acc) 


return self 


Once you're done with executing this code, let's now initialize a new 784-100-10 
MLP—a neural network with 784 input units (n_ features), 100 hidden units 
(n hidden), and 10 output units (n_ output): 


>>>nn = NeuralNetMLP(n hidden=100, 
IZ=0 501, 
epochs=200, 
eta=0.0005, 
MNInbetCn. size-10U, 
shuffle=True, 
seed=1) 


If you read through the NeuralNetMLP code, you've probably already guessed what 
these parameters are for. Here, you find a short summary of these: 


e 12: This is the “ parameter for L2 regularization to decrease the degree of 
overfitting. 
® epochs: This is the number of passes over the training set. 


e eta: This is the learning rate 7 

® shuffle: This is for shuffling the training set prior to every epoch to prevent 
that the algorithm gets stuck in circles. 

¢ seed: This is a random seed for shuffling and weight initialization. 

® minibatch size: This 1s the number of training samples in each mini-batch 
when splitting of the training data in each epoch for stochastic gradient descent. 


WOW! eBook 
www.wowebook.org 


The gradient is computed for each mini-batch separately instead of the entire 
training data for faster learning. 


Next, we train the MLP using 55,000 samples from the already shuffled MNIST 
training dataset and use the remaining 5,000 samples for validation during training. 
Note that training the neural network may take up to 5 minutes on standard desktop 
computer hardware. 


As you may have noticed from the preceding code implementation, we implemented 
the £it method so that it takes four input arguments: training images, training labels, 
validation images, and validation labels. In neural network training, it 1s really useful 
to already compare training and validation accuracy during training, which helps us 
judge whether the network model performs well, given the architecture and 
hyperparameters. 


In general, training (deep) neural networks 1s relatively expensive compared with the 
other models we discussed so far. Thus, we want to stop it early in certain 
circumstances and start over with different hyperparameter settings. Alternatively, if 
we find that it increasingly tends to overfit the training data (noticeable by an 
increasing gap between training and validation set performance), we may want to 
stop the training early as well. 


Now, to start the training, we execute the following code: 


Zo Wied Loi Teena eeaim lt sooUUUT, 
y Urain=y train|<55000), 
xX. Valio=x%. Crain loo0002 |, 
oe VY Velio=-y Train loo0002 |) 
ZOO; 200 ||, COSEe S0G54 76 || "Train Valic. ACe.= 99.202/70 7/2984 


In our NeuralNetMLP implementation, we also defined an eval_ attribute that 
collects the cost, training, and validation accuracy for each epoch so that we can 
visualize the results using Matplotlib: 


>>> import matplotlib.pyplot as plt 

Zee Dis PLOU Pande (Ml«CpOCHS) , NM.eVel | COse: |) 
Por pll.«ylLabel(* Cost") 

Por Pitl«X Lapel (* EpOCKsS ) 

>>> plt.show() 


The preceding code plots the cost over the 200 epochs, as shown 1n the following 
graph: 
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As we can see, the cost decreased substantially during the first 100 epochs and seems 
to slowly converge in the last 100 epochs. However, the small slope between epoch 
175 and epoch 200 indicates that the cost would further decrease with a training over 
additional epochs. 


Next, let's take a look at the training and validation accuracy: 


>>> pit. 
> > ple 
> OL. 
>>> plt 


SSS ple. 
>>> pit. 


PLOU (Penge (in. epochs), Mn.eval | rain acc” |, 
label='training') 

-PlLOl (Penge (nn. epocns),; Nnh.eval | * valid. ace” l,; 
label='validation', linestyle='--') 

ylabel ('Accuracy') 


.xlabel ('Epochs') 


legend () 
show () 


The preceding code examples plot those accuracy values over the 200 training 
epochs, as shown in the following figure: 
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phe eee ses en seeaneern™ 
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—— training 
-=-= validation 











The plot reveals that the gap between training and validation accuracy increases the 
more epochs we train the network. At approximately the 50th epoch, the training and 
validation accuracy values are equal, and then, the network starts overfitting the 
training data. 


Note that this example was chosen deliberately to illustrate the effect of overfitting 
and demonstrate why it is useful to compare the validation and training accuracy 
values during training. One way to decrease the effect of overfitting is to increase the 
regularization strength—for example, by setting 12=0.1. Another useful technique to 
tackle overfitting in neural networks, dropout, will be covered in Chapter 15, 
Classifying Images with Deep Convolutional Neural Networks. 


Finally, let's evaluate the generalization performance of the model by calculating the 
prediction accuracy on the test set: 
yor Y Lest pie? = fa.precicl i test) 


Por ace = ANiD.SUuln(y test == Y “est. pred) 


WOW! eBook 
www.wowebook.org 


_— -astype(np.float) / X test.shape[0]) 
>> PLING ("Training accuracy: o.Zlcso « (ace * 100),) 
Test accuracy: 97.54% 


Despite the slight overfitting on the training data, our relatively simple one-hidden 
layer neural network achieved a relatively good performance on the test dataset, 
similar to the validation set accuracy (97.98 percent). 


To further fine-tune the model, we could change the number of hidden units, values 
of the regularization parameters, and the learning rate or use various other tricks that 
have been developed over the years but are beyond the scope of this book. In 
Chapter 14, Going Deeper — The Mechanics of TensorFlow, you will learn about a 
different neural network architecture that is known for its good performance on 
image datasets. Also, the chapter will introduce additional performance-enhancing 
tricks such as adaptive learning rates, momentum learning, and dropout. 


Lastly, let's take a look at some of the images that our MLP struggles with: 


yee MLeCl 1G = % CEscly best i= yy test. pred] | 325 
Pee COLCreCE dab = VY Test ly test §=— ¥ Test pred] [Zo ) 
Po? WESC! -sab= Y Gest peed ly teste t= 7: Lest pred] i-Zol 


>>> fig, ax = plt.subplots (nrows=5, 
nCcolLs=5, 
sharex=True, 
; ois sharey=True, ) 
>>> ax = ax.flatten () 
>>> for 1 in range(25): 
img = Misch amg )a:) «reshape (20, .23) 
ax[1i].imshow(img, 
cmap='Greys', 
interpolation='nearest') 
ax[1].set title('sd) t: cd p: «d' 


O 


% (i+1l, correct lab[i], miscl lab[i])) 


eer CxO lieoee. ele el.) 
Por ex) | «see yuelcks{[1) 
wee Pes ome dweayour.|) 
>>> plt.show() 


We should now see a 5 x 5 subplot matrix where the first number 1n the subtitles 
indicates the plot index, the second number represents the true class label (t), and the 
third number stands for the predicted class label (p): 
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As we can see 1n the preceding figure, some of those images are even challenging for 
us humans to classify correctly. For example, the 6 in subplot 8 really looks like a 
carelessly drawn 0, and the 8 in subplot 23 could be a 9 due to the narrow lower part 
combined with the bold line. 
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Traiming an artificial neural network 


Now that we have seen a neural network in action and have gained a basic 
understanding of how it works by looking over the code, let's dig a little bit deeper 
into some of the concepts, such as the logistic cost function and the backpropagation 
algorithm that we implemented to learn the weights. 
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Computing the logistic cost function 


The logistic cost function that we implemented as the compute cost method 1s 
actually pretty simple to follow since it 1s the same cost function that we described in 
the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers 
Using scikit-learn: 


J(w)= 5 vlog aq - (1 — yl! log ( |— a"! 
i=| ) | | 


Here, a" is the sigmoid activation of the ith sample in the dataset, which we 
compute in the forward propagation step: 


al = bf 2 


Again, note that in this context, the superscript /i/ 1s an index for training samples, 
not layers. 


Now, let's add a regularization term, which allows us to reduce the degree of 
overfitting. As you recall from earlier chapters, the L2 regularization term is defined 
as follows (remember that we don't regularize the bias units): 


. it 

2 ee 
wl =A2> w* 

2 =i 


L2=A 








By adding the L2 regularization term to our logistic cost function, we obtain the 
following equation: 
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(w)=- bx "og( al!) + +(1 = yoga!) | +2 fo 


Since we implemented an MLP for multiclass classification that returns an output 
vector of t elements that we need to compare to the tx / dimensional target vector in 
the one-hot encoding representation, for example, the activation of the third layer 
and the target class (here, class 2) for a particular sample may look like this: 


0.1 0 

toa 0:9 
if. ae , \ — 

0.3 0 


Thus, we need to generalize the logistic cost function to all ¢ activation units in our 
network. Thus, the cost function (without the regularization term) becomes the 
following: 


= LY Mlog(a" )+ +(1-y! '}log(1-a"',} 


Here, again, the superscript (i) 1s the index of a particular sample 1n our training set. 


The following generalized regularization term may look a little bit complicated at 
first, but here we are just calculating the sum of all weights of an / layer (without the 
bias term) that we added to the first column: 
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J(W)=- b 3 yToe | ql! : + | |— yl! ) log |— qi!) | 


i=l j=! 


Here, uy; refers to the number of units in a given layer /, and the following expression 


represents the penalty term: 
Mi) 


zy YY (w) ') 


PS). rk. isl 


1(W) 


Remember that our goal is to minimize the cost function — ; thus we need to 
calculate the partial derivative of the parameters W with respect to each weight for 
every layer in the network: 


O 
aad (IF 
el 


In the next section, we will talk about the backpropagation algorithm, which allows 
us to calculate those partial derivatives to minimize the cost function. 


Note that consists of multiple matrices. In a multilayer perceptron with one 
(hr) 
hidden unit, we have the weight matrix W , which connects the input to the 


[ out ) 
hidden layer, and W , which connects the hidden layer to the output layer. An 
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intuitive visualization of the three-dimensional tensor is provided in the 
following figure: 
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Features (rows) 
Hidden units (rows) 


oo, yw" py bout 
In this simplified figure, 1t may seem that both and have the same 
number of rows and columns, which is typically not the case unless we initialize an 
MLP with the same number of hidden units, output units, and input features. 


If this sounds confusing, stay tuned for the next section, where we will discuss the 


yw" | wien) _ 
dimensionality of and in more detail in the context of the 
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backpropagation algorithm. Also, I want to encourage you to read through the code 
of the NeuralNetMLP again, which I annotated with helpful comments about the 
dimensionality with regard to the different matrices and vector transformations. You 
can obtain the annotated code either from Packt or the book's GitHub repository at 
https://github.com/rasbt/python-machine-learning-book-2nd-edition. 
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Developing your intuition for backpropagation 


Although backpropagation was rediscovered and popularized more than 30 years ago 
(Learning representations by back-propagating errors, D. E. Rumethart, G. E. 
Hinton, and R. J. Williams, Nature, 323: 6088, pages 533—536, 1986), it still remains 
one of the most widely used algorithms to train artificial neural networks very 
efficiently. If you are interested in additional references regarding the history of 
backpropagation, Juergen Schmidhuber wrote a nice survey article, Who Invented 
Backpropagation?, which you can find online at http://people.idsia.ch/~juergen/who- 


invented-backpropagation. html. 


In this section, I intend to provide a short and intuitive summary and the bigger 
picture of how this fascinating algorithm works before we dive into more 
mathematical details. In essence, we can think of backpropagation as a very 
computationally efficient approach to compute the partial derivatives of a complex 
cost function in multilayer neural networks. Here, our goal 1s to use those derivatives 
to learn the weight coefficients for parameterizing such a multilayer artificial neural 
network. The challenge in the parameterization of neural networks 1s that we are 
typically dealing with a very large number of weight coefficients in a high- 
dimensional feature space. In contrast to cost functions of single-layer neural 
networks such as Adaline or logistic regression, which we have seen in previous 
chapters, the error surface of a neural network cost function is not convex or smooth 
with respect to the parameters. There are many bumps in this high-dimensional cost 
surface (local minima) that we have to overcome 1n order to find the global 
minimum of the cost function. 


You may recall the concept of the chain rule from your introductory calculus classes. 
The chain rule is an approach to compute the derivative of a complex, nested 
function, such as f(g(x)), as follows: 


“in(et)]-£ 


dx > ge ax 








Similarly, we can use the chain rule for an arbitrarily long function composition. For 
example, let's assume that we have five different functions, f (x), g(x), h(x), u(x), and 
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v(x), and let F be the function composition: F(x) = f(g(h(u(v(x))))). Applying the 
chain rule, we can compute the derivative of this function as follows: 
ar 


d _ ad af ds #4 “ii af ¢ 
—— = — FF (x) = — fl el ale (x) et ee es 
dx ax ) cl J f | WX) )) da dh du dv dx 


In the context of computer algebra, a set of techniques has been developed to solve 
such problems very efficiently, which 1s also known as automatic differentiation. If 
you are interested in learning more about automatic differentiation 1n machine 
learning applications, I recommend that you read A. G. Baydin and B. A. 
Pearlmutter's article Automatic Differentiation of Algorithms for Machine Learning, 
arXiv preprint arXiv:1404.7456, 20/4, which 1s freely available on arXiv at 


http://arxiv.org/pdf/1404.7456.pdf. 


Automatic differentiation comes with two modes, the forward and reverse modes; 
backpropagation 1s simply just a special case of reverse mode automatic 
differentiation. The key point 1s that applying the chain rule in the forward mode can 
be quite expensive since we would have to multiply large matrices for each layer 
(Jacobians) that we eventually multiply by a vector to obtain the output. The trick of 
reverse mode is that we start from right to left: we multiply a matrix by a vector, 
which yields another vector that is multiplied by the next matrix and so on. Matrix- 
vector multiplication is computationally much cheaper than matrix-matrix 
multiplication, which is why backpropagation is one of the most popular algorithms 
used in neural network training. 


Note 


To fully understand backpropagation, we need to borrow certain concepts from 
differential calculus, which is beyond the scope of this book. However, I have 
written a review chapter of the most fundamental concepts, which you might find 
useful in this context. It discusses function derivatives, partial derivatives, gradients, 
and the Jacobian. I made this text freely accessible at 
https://sebastianraschka.com/pdf/books/dlb/appendix_d_calculus.pdf. If you are 
unfamiliar with calculus or need a brief refresher, consider reading this text as an 
additional supporting resource before reading the next section. 
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Training neural networks via backpropagation 


In this section, we will go through the math of backpropagation to understand how 
you can learn the weights in a neural network very efficiently. Depending on how 
comfortable you are with mathematical representations, the following equations may 
seem relatively complicated at first. 


In a previous section, we saw how to calculate the cost as the difference between the 
activation of the last layer and the target class label. Now, we will see how the 
backpropagation algorithm works to update the weights in our MLP model from a 
mathematical perspective, which we implemented in the # Backpropagation section 
inside the fit method. As we recall from the beginning of this chapter, we first need 
to apply forward propagation in order to obtain the activation of the output layer, 
which we formulated as follows: 


Z'" = A“ W"” (net input of the hidden layer ) 


(} poate’ Vv : . | 
A” = g(Z | 4) (activation of the hidden layer ) 


Ze") = AM ye) (net input of the output layer ) 


Aout) o( Zz" (activation of the output layer ) 


Concisely, we just forward-propagate the input features through the connection in 
the network, as shown in the following illustration: 
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In backpropagation, we propagate the error from right to left. We start by calculating 
the error vector of the output layer: 


a) = gh") -~y 


Here, y is the vector of the true class labels (the corresponding variable in the 
NeuralNetMLP code 1s sigma out). 


Next, we calculate the error term of the hidden layer: 


- 
a (it) 


, 
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lM} 8. ee ; ; —— ’ : 
Here, is simply the derivative of the sigmoid activation function, which 
we computed as sigmoid derivative h = ah * (1. - a h) inthe fit method of 


the NeuralNetMLP: 


0(z) - (a'” (1 7 qi) ) 


7, 


OZ 


Note that the © symbol means element-wise multiplication in this context. 


Note 


Although it is not important to follow the next equations, you may be curious how I 
obtained the derivative of the activation function; I have summarized the derivation 
step by step here: 


#(2)-=|— 


oz\l+e~- 
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Te 7 | 

(1+ ) a | 
ari 
(Ite *) l+e 


=a(l—a) 


—~{ ft 


ir) 
Next, we compute the O layer error matrix (sigma_h) as follows: 


5h) = (out) (win) | 1 © ( | a”)) 
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«(fr 
To better understand how we computed this O term, let's walk through it in more 


ie 
V [ over } 
detail. In the preceding equation, we used the transpose | of the h x t- 


dimensional matrix ” . Here, t is the number of output class labels and / is the 
number of hidden units. The matrix multiplication between the n x t-dimensional 

ee 

, cnt | . . We | cour } . 

matrix and the ¢ x h-dimensional matrix , results in ann x t- 

dimensional matrix that we multiplied elementwise by the sigmoid derivative of the 


~=( fr | 
same dimension to obtain the n x t-dimensional matrix O 


Eventually, after obtaining the O terms, we can now write the derivation of the cost 
function as follows: 


r) 





( \ (hh) oleur) 
a ae f\yr 
a. (out) J (HW i 0; 
CW. . 

tJ 

O 1(W) = gn) st) 
a (h) ii: iia q; 
OW. . 


Next, we need to accumulate the partial derivative of every node in each layer and 
the error of the node in the next layer. However, remember that we need to compute 
(i 
‘| for every sample in the training set. Thus, it is easier to implement it as a 
vectorized version like in our NeuralNetMLP code implementation: 


ACh) alt) +(a®) sf!) 
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Alt) = A") 4 (AM) gor 


And after we have accumulated the partial derivatives, we can add the regularization 
term: 


/ . : ‘ 
AY = AY + 7 (except for the bias term ) 


The two previous mathematical equations correspond to the code variables 
delta w h, delta b h, delta w out, and delta b out IN NeuralNetMLP. 


Lastly, after we have computed the gradients, we can now update the weights by 
taking an opposite step towards the gradient for each layer /: 


ww —na 


This is implemented as follows: 


Scliew Jy = Sellveta * Celta .W i 
Sei. Ti == Se. tZec. = eles 
Seliww OUL. “= Selt.eca * Gelta woul 
Sselt.<®.OUl == Seli.elta * Celta » Out 


To bring everything together, let's summarize backpropagation in the following 
figure: 
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Compute the gradient: 
0 , | 
r) — (Ah) o(out) 

, fess Oe q; 0; 

L, J 


Error term of the output layer: 
§ (out) — q (out) -y 






Output y ~< Target y 


Error term of the hidden layer: 


- anne, 7(h) 
Compute the gradient: gh) = gout) (weut))” 6 2G") 
0 _ _ (in) g(h) «ag 
—T/ (W) = a; 6; 

Ow; 
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About the convergence in neural 
networks 


You might be wondering why we did not use regular gradient descent but instead 
used mini-batch learning to train our neural network for the handwritten digit 
classification. You may recall our discussion on stochastic gradient descent that we 
used to implement online learning. In online learning, we compute the gradient based 
on a single training example (& = /) at a time to perform the weight update. 
Although this is a stochastic approach, it often leads to very accurate solutions with a 
much faster convergence than regular gradient descent. Mini-batch learning 1s a 
special form of stochastic gradient descent where we compute the gradient based on 
a subset é of the n training samples with J < k < n. Mini-batch learning has the 
advantage over online learning that we can make use of our vectorized 
implementations to improve computational efficiency. However, we can update the 
weights much faster than in regular gradient descent. Intuitively, you can think of 
mini-batch learning as predicting the voter turnout of a presidential election from a 
poll by asking only a representative subset of the population rather than asking the 
entire population (which would be equal to running the actual election). 


Multilayer neural networks are much harder to train than simpler algorithms such as 
Adaline, logistic regression, or support vector machines. In multilayer neural 
networks, we typically have hundreds, thousands, or even billions of weights that we 
need to optimize. Unfortunately, the output function has a rough surface and the 
optimization algorithm can easily become trapped in local minima, as shown in the 
following figure: 
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Local Random, initial condition 







cost minimum 


Global 
cost minimum 


J(W) 





Note that this representation 1s extremely simplified since our neural network has 
many dimensions; it makes it impossible to visualize the actual cost surface for the 
human eye. Here, we only show the cost surface for a single weight on the x-axis. 
However, the main message is that we do not want our algorithm to get trapped in 
local minima. By increasing the learning rate, we can more readily escape such local 
minima. On the other hand, we also increase the chance of overshooting the global 
optimum if the learning rate is too large. Since we initialize the weights randomly, 
we start with a solution to the optimization problem that is typically hopelessly 
wrong. 
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A few last words about the neural 
network implementation 


You may be wondering why we went through all of this theory just to implement a 
simple multilayer artificial network that can classify handwritten digits instead of 
using an open source Python machine learning library. In fact, we will introduce 
more complex neural network models in the next chapters, which we will train using 
the open source TensorFlow library (https://www.tensorflow.org). Although the 
from scratch implementation in this chapter seems a bit tedious at first, 1t was a good 
exercise for understanding the basics behind backpropagation and neural network 
training, and a basic understanding of algorithms is crucial for applying machine 
learning techniques appropriately and successfully. 


Now that you have learned how feedforward neural networks work, we are ready to 
explore more sophisticated deep neural networks, such as TensorFlow and Keras 
(https://keras.io), which allow us to construct neural networks more efficiently, as we 
will see in Chapter 13, Parallelizing Neural Network Training with TensorFlow. 
Over the past two years, since its release in November 2015, TensorFlow has gained 
a lot of popularity among machine learning researchers, who use it to construct deep 
neural networks because of its ability to optimize mathematical expressions for 
computations on multi dimensional arrays utilizing Graphical Processing Units 
(GPUs). While TensorFlow can be considered a low-level deep learning library, 
simplifying API such as Keras have been developed that make the construction of 
common deep learning models even more convenient, which we will see in Chapter 
13, Parallelizing Neural Network Training with TensorFlow. 
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Summary 


In this chapter, you have learned the basic concepts behind multilayer artificial 
neural networks, which are currently the hottest topics in machine learning research. 
In Chapter 2, Training Simple Machine Learning Algorithms for Classification, we 
started our journey with simple single-layer neural network structures and now we 
have connected multiple neurons to a powerful neural network architecture to solve 
complex problems such as handwritten digit recognition. We demystified the popular 
backpropagation algorithm, which 1s one of the building blocks of many neural 
network models that are used in deep learning. After learning about the 
backpropagation algorithm in this chapter, we are well-equipped for exploring more 
complex deep neural network architectures. In the remaining chapters, we will 
introduce TensorFlow, an open source library geared towards deep learning, which 
allows us to implement and train multilayer neural networks more efficiently. 
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Chapter 13. Parallelizing Neural 
Network Training with TensorFlow 


In this chapter, we'll move on from the mathematical foundations of machine 
learning and deep learning to introducing TensorFlow. TensorFlow 1s one of the 
most popular deep learning libraries currently available, and it can let us implement 
neural networks much more efficiently than any of our previous NumPy 
implementations. In this chapter, we'll start using TensorFlow and see how it brings 
significant benefits to training performance. 


This chapter begins the next stage of our journey into training machine learning and 
deep learning, and we'll explore the following topics: 


How TensorFlow improves training performance 

Working with TensorFlow to write optimized machine learning code 

Using TensorFlow high-level APIs to build a multilayer neural network 
Choosing activation functions for artificial neural networks 

Introducing Keras, a high-level wrapper around TensorFlow, for implementing 
common deep learning architectures most conveniently 
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TensorFlow and training performance 


TensorFlow can speed up our machine learning tasks significantly. To understand 
how it can do this, let's begin by discussing some of the performance challenges we 
typically run into when we run expensive calculations on our hardware. 


The performance of computer processors has, of course, been improving 
continuously over recent years, and that's allowed us to train more powerful and 
complex learning systems, and so to improve the predictive performance of our 
machine learning models. Even the cheapest desktop computer hardware that's 
available right now comes with processing units that have multiple cores. 


Also, in the previous chapters, we saw that many functions 1n scikit-learn allowed us 
to spread those computations over multiple processing units. However, by default, 
Python is limited to execution on one core due to the Global Interpreter Lock 
(GIL). So, although we, indeed, take advantage of its multiprocessing library to 
distribute our computations over multiple cores, we still have to consider that the 
most advanced desktop hardware rarely comes with more than 8 or 16 such cores. 


If we recall from Chapter 12, Implementing a Multilayer Artificial Neural Network 
from Scratch, where we implemented a very simple multilayer perceptron with only 
one hidden layer consisting of 100 units, we had to optimize approximately 80,000 
weight parameters (/784*/00 + 100] + [100 * 10] + 10 = 79,510) to learn a model 
for a very simple image classification task. The images in MNIST are rather small 
(28 x 28 pixels), and we can only imagine the explosion in the number of parameters 
if we want to add additional hidden layers or work with images that have higher 
pixel densities. 


Such a task would quickly become unfeasible for a single processing unit. The 
question then becomes—how can we tackle such problems more effectively? 


The obvious solution to this problem is to use GPUs, which are real work horses. 
You can think of a graphics card as a small computer cluster inside your machine. 
Another advantage is that modern GPUs are relatively cheap compared to the state- 
of-the-art CPUs, as we can see in the following overview: 
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Intel® Core™ i7-6900K NVIDIA GeForce® 
Processor Extreme Ed. GTX' 1080 Ti 


Specifications 





Base Clock Frequency 3.2 GHz < 1.5 GHz 
Cores g 3584 
“Memory Bandwidth 64 GB/s 484 GB/s 
Floating-Point Calculations 409 GFLOPS | 1300 GFLOPS 








Cost ~ $1000.00 ~ $700.00 


The sources for the information in the table are the following websites: 


e https://www.intel.com/content/www/us/en/products/processors/core/x-series/17- 
6900k. html 


e https://www.nvidia.com/en-us/geforce/products/1O0series/geforce-gtx-1080-ti/ 





(Date: August 2017) 


At 70 percent of the price of a modern CPU, we can get a GPU that has 450 times 
more cores and 1s capable of around 15 times more floating-point calculations per 
second. So, what 1s holding us back from utilizing GPUs for our machine learning 
tasks? 


The challenge is that writing code to target GPUs 1s not as simple as executing 
Python code 1n our interpreter. There are special packages, such as CUDA and 
OpenCL, that allow us to target the GPU. However, writing code in CUDA or 
OpenCL is probably not the most convenient environment for implementing and 
running machine learning algorithms. The good news is that this is what TensorFlow 
was developed for! 
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What is TensorFlow? 


TensorFlow is a scalable and multiplatform programming interface for implementing 
and running machine learning algorithms, including convenience wrappers for deep 
learning. 


TensorFlow was developed by the researchers and engineers of the Google Brain 
team; and while the main development 1s led by a team of researchers and software 
engineers at Google, its development also involves many contributions from the 
open source community. TensorFlow was initially built for only internal use at 
Google, but it was subsequently released in November 2015 under a permissive open 
source license. 


To improve the performance of training machine learning models, TensorFlow 
allows execution on both CPUs and GPUs. However, its greatest performance 
capabilities can be discovered when using GPUs. TensorFlow supports CUDA- 
enabled GPUs officially. Support for OpenCL-enabled devices is still experimental. 
However, OpenCL will likely be officially supported in near future. 


TensorFlow currently supports frontend interfaces for a number of programming 
languages. Lucky for us as Python users, TensorFlow's Python API is currently the 
most complete API, thereby attracting many machine learning and deep learning 
practitioners. Furthermore, TensorFlow has an official API in C++. 


The APIs in other languages, such as Java, Haskell, Node.js, and Go, are not stable 
yet, but the open source community and TensorFlow developers are constantly 
improving them. TensorFlow computations rely on constructing a directed graph for 
representing the data flow. Even though building the graph may sound complicated, 
TensorFlow comes with high-level APIs that has made it very easy. 
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How we will learn TensorFlow 


We'll learn first of all about the low-level TensorFlow API. While implementing 
models at this level can be a little bit cumbersome at first, the advantage of the low- 
level API is that it gives us more flexibility as programmers to combine the basic 
operations and develop complex machine learning models. Starting from 
TensorFlow version 1.1.0, high-level APIs are added on top of the low-level API (for 
instance, the so-called Layers and Estimators APIs), which allow building and 
prototyping models much faster. 


After learning about the low-level API, we will move forward to explore two high- 
level APIs, namely TensorFlow Layers and Keras. However, let's begin by taking 
our first steps with TensorFlow low-level API, and ease ourselves into how 
everything works. 
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First steps with TensorFlow 


In this section, we'll take our first steps in using the low-level TensorFlow API. 
Depending on how your system is set up, you can typically just use Python's pip 
installer and install TensorFlow from PyPI by executing the following from your 
Terminal: 


pip install tensorflow 


In case you want to use GPUs, the CUDA Toolkit as well as the NVIDIA cuDNN 
library need to be installed; then you can install TensorFlow with GPU support, as 
follows: 


pip install tensorflow-gpu 


TensorFlow 1s under active development; therefore, every couple of months, newer 
versions are released with significant changes. At the time of writing this chapter, the 
latest TensorFlow version is 1.3.0. You can verify your TensorFlow version from 
your Terminal, as follows: 


python -c ‘import tensorflow as tf; print(tf. version _)' 


Note 


If you should experience problems with the installation procedure, I recommend you 
to read more about system- and platform-specific recommendations that are provided 
at https://www.tensorflow.org/install/. Note that all the code in this chapter can be 
run on your CPU; using a GPU 1s entirely optional but recommended if you want to 
fully enjoy the benefits of TensorFlow. If you have a graphics card, refer to the 
installation page to set 1t up appropriately. In addition, you may find this 
TensorFlow-GPU setup guide helpful, which explains how to install the NVIDIA 
graphics card drivers, CUDA, and cuDNN on Ubuntu (not required but 
recommended requirements for running TensorFlow on a GPU): 


https://sebastianraschka.com/pdf/books/dlb/appendix_h_ cloud-computing.pdf. 


TensorFlow is built around a computation graph composed of a set of nodes. Each 
node represents an operation that may have zero or more input or output. The values 
that flow through the edges of the computation graph are called tensors. 


Tensors can be understood as a generalization of scalars, vectors, matrices, and so 


WOW! eBook 
www.wowebook.org 


on. More concretely, a scalar can be defined as a rank-0 tensor, a vector as a rank-] 
tensor, a matrix as a rank-2 tensor, and matrices stacked in a third dimension as rank- 
3 tensors. 


Once a computation graph is built, the graph can be launched in a TensorFlow 
Session for executing different nodes of the graph. In Chapter 14, Going Deeper — 
The Mechanics of TensorFlow, we will cover the steps in building the computation 
graph and launching the graph in a session in more detail. 


As a warm-up exercise, we will start with the use of simple scalars from TensorFlow 
to compute a net input z of a sample point x in a one-dimensional dataset with weight 
w and bias b: 


Z=wxxtb 


The following code shows the implementation of this equation in the low-level 
TensorFlow API: 


import tensorflow as tf 


#t# create a graph 
g = tf.Graph() 
WLED -Gwas GStault() 
x = tf£.placeholder (dtype=tf.float32, 
shape=(None), name=!'x') 
w = tf.Variable(2.0, name='weight') 
b = t£.Variable(0O.7, name='bias'") 


Z= w*x + b 


Iie. = tigglobea. Variavles Tittle li7er () 
## Create a session and pass in graph g 
with tf.Session(graph=g) as sess: 

## initialize w and b: 

sess.run(init) 

## evaluate Zz: 

fOr tam (le, 0.26, =L.c]: 

print ('x=%4.1f --> z=%4.1£'3S ( 
cy Sees, TU (Z, £eeo CiGcr={x2c)))) 


After executing the previous code, you should see the following output: 
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X= 1.0 --> z= 2.7 
X= 0.6 --> z= 1.9 
Xx=-1.8 --> z=-2.9 


This was pretty straightforward, right? In general, when we develop a model in the 
TensorFlow low-level API, we need to define placeholders for input data (x, y, and 
sometimes other tunable parameters); then, define the weight matrices and build the 
model from input to output. If this is an optimization problem, we should define the 
loss or cost function and determine which optimization algorithm to use. TensorFlow 
will create a graph that contains all the symbols that we have defined as nodes in this 
graph. 


Here, we created a placeholder for x with shape=(None). This allows us to feed the 
values in an element-by-element form and as a batch of input data at once, as 
follows: 


>>> with tf.Session(graph=g) as sess: 
sess.run(init) 
PHine(Ssess.2un(Z, tesc OvCce=(s2 biee 2ag Cal ty) 


| Ze7TQO00000S 220999995. 6.07777 70) 


Note 


Note that we are omitting Python's command-line prompt in several places in this 
chapter to improve the readability of long code examples by avoiding unnecessary 
text wrapping; this is because TensorFlow's function and method names can be very 
verbose. 


Also, note that the official TensorFlow style guide 
(https://www.tensorflow.org/community/style_ guide) recommends using two- 
character spacing for code indents. However, we chose four characters for indents as 
it is more consistent with the official Python style guide and also helps in displaying 
the code syntax highlighting in many text editors correctly as well as the 
accompanying Jupyter code notebooks at https://github.com/rasbt/python-machine- 


learning-book-2nd-edition. 
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Working with array structures 


Let's discuss how to use array structures in TensorFlow. By executing the following 


code, we will create a simple rank-3 tensor of size batchsize x 2x 3 , reshape it, and 
calculate the column sums using TensorFlow's optimized expressions. Since we do 
not know the batch size a priori, we specify None for the batch size in the argument 
for the shape parameter of the placeholder x: 


import tensorflow as tf 
import numpy as np 


g = tf.Graph() 
Vien Gases Oe foul 
x = tf.placeholder (dtype=tf.float32, 
shape=(None, 2, 3), 
name='input x') 


x2 = tf.reshape(x, shape=(-l, 6), 
name=!x2') 


## Calculate the sum of each column 
XSum = Ci.reocuce sum( xz, axis=0, Name="col sum") 


## Calculate the mean of each column 
xMe@an = Tisrecuce Mean(xz2, axre=0;, name="Col. mean”) 


with tf.Session(graph=g) as sess: 
x atray — NMp~arange(1s).reshtape(s, zy 2) 


PrLine(*anpul Shape: “,; x array.shape) 
print ('Reshaped:\n', 

Sess.tuUm(sZ, feed Ci1ce=| xx. array) ) 
bring(*Columa Stns. \n"', 

S6ss.tUn(xsum, Peed cici={(x%7x array) ) 
print ('Column Means:\n', 

Sess.LuUn(xmean, feed Ci1cl={x x array) 


The output shown after executing the preceding code is given here: 


input shape: (3; Z, 3) 

Reshaped: 
[[ OQ. ig va or 4. os! 
[ 6. Te oe Oo, wy Wis | 


if '2, Loe 24. 4a. Bo. 2 
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Column Sums: 
[ 18. ae 24. Lax 30. cco 


Column Means: 
[ 6. 7. oS. oa ila Oa ele" 


In this example, we worked with three functlons—t f.reshape, tf.reduce sum, and 
tf.reduce mean. Note that for reshaping, we used the value -1 for the first 
dimension. This 1s because we do not know the value of batch size; when reshaping a 
tensor, 1f you use -1 for a specific dimension, the size of that dimension will be 
computed according to the total size of the tensor and the remaining dimension. 
Therefore, t£.reshape (tensor, shape=(-1,)) can be used to flatten a tensor. 


Feel free to explore other TensorFlow functions from the official documentation at 
https://www.TensorFlow.org/api_docs/python/tf. 


WOW! eBook 
www.wowebook.org 


Developing a simple model with the low-level 
TensorFlow API 


Now that we have familiarized ourselves with TensorFlow, let's take a look at a 
really practical example and implement Ordinary Least Squares (OLS) regression. 
For a quick refresher on regression analysis, refer to Chapter 10, Predicting 
Continuous Target Variables with Regression Analysis. 


Let's start by creating a small one-dimensional toy dataset with 10 training samples: 


>>> import tensorflow as tf 
Zoo TMPOre, NuUnNpy as. np 


Pee 
por x Crain = fpsatange (10) .cesnape. (10, 1) ) 
par ye eel = eset vo day tee Sadly 

Lele. Diy Owes 

6s, 1645 GeO; 

9.0] ) 


Given this dataset, we want to train a linear regression model to predict the output y 
from the input x. Let's implement this model in a class, which we name TfLinreg. 
For this, we would need two placeholders—one for the input x and one for y for 
feeding the data into our model. Next, we need to define the trainable variables— 
weights w and bias b. 


—_— — 


Then, we can define the linear regression model as ~ WXX+D , followed by 
defining the cost function to be the Mean of Squared Error (MSE). To learn the 
weight parameters of the model, we use the gradient descent optimizer. The code 1s 
as follows: 


class TfLinreg(object): 


Cer cimt ~sele, x *dim, Jeatmang Pare-U.01, 
rangom SCeo=None) « 
Sseli«x Clim = < dim 
SelLt.dearning Pate = learning care 


self.g = tf.Graph() 

## build the model 

WEED Sela .G.as. Getaule () : 
## set graph-level random-seed 
Liesel Langom Seco.(rangom seed) 


self. bui1ld() 
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## Create initializer 
Sseli.iiLe Op = Clsolobal Variebiles tna Ctializer {) 


def build(self): 
## define placeholders for inputs 
self.X = tf.placeholder (dtype=tf.float32, 
shape=(None, selt.<x dim), 
hame=" xs snput”) 
self.y = tf.placeholder (dtype=tf.float32, 
shape=(None), 
name='y input') 
print (self.X) 
print(self.y) 
## define weight matrix and bias vector 
w = tf.Variable(tf.zeros(shape=(1)), 
name='weight') 
b = tf.Variable(tf.zeros(shape=(1)), 
name="bias") 
print (w) 
print (b) 


Sella Wee S| Piece 7 ella a uy 
Neame=" 27. net * ) 
Prine (Sselt.7 met) 


SGt Srrors = tiscouarc (sclt.y - SeliaZm Nev, 
Wele= soe Eos) 
PrineisqEe errors) 
sGliamean Cost = Li .recuce. Mean (Sor Crrors, 
hame=—"Meall Cost”) 


optimizer = tf.train.GradientDescentOptimizer ( 
Learning: Fale=seli.tearii ng are, 
name='GradientDescent') 

Sell eOPUlmiZer = OPtimiZer mimi 7e (Se li mean. Cost) 


So far, we have defined a class to construct our model. We will create an instance of 
this class and call it 1rmodel, as follows: 


Poem IIMOCeL = Tihanreg(x.cdim=x train. shape lt); learning tate=0.01) 


The print statements that we wrote in the build method will display information 
about six nodes in the graph—x, y, w, b, z net, and sqr errors—with their names 
and shapes. 


These print statements are optionally given for practice; however, inspecting the 
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shapes of variables can be very helpful in debugging complex models. The following 
lines are printed when constructing the model: 


Tensor ("ss anpats0™, Shape=(2, 1), Geype=_loeLljIZ) 
Tensor "yy input. 0"; Gdtype=—1loaTej2Z) 

<CleVvelleaoue "WwelgnLsO” shape=(1,) Glype=(lodtoZ cel 
<—LiwVaelleable “Dias; 0” shape=(1,) GLype=floaetoz .rer> 
Tensor ("2 nec. U0", -Alype=—loat3Z) 
Tensor ("SOT Crrors:;0", Clype=11OaesZ) 


The next step is to implement a training function to learn the weights of the linear 
regression model. Note that b is the bias unit (the y-axis intercept at x = 0). 


For training, we implement a separate function that needs a TensorFlow session, a 
model instance, training data, and the number of epochs as input arguments. In this 
function, first we initialize the variables in the TensorFlow session using the init op 
operation defined in the model. Then, we iterate and call the optimizer operation of 
the model while feeding the training data. This function will return a list of training 
costs as a side product: 


Get Crain lLanteg (sess, model, xX Tlaim, VY train, num -cpochs—=10)< 
## initialiaze all variables: W and b 
Sess.tUn(mocdel.1am1e Op) 


tlelning Costs = i] 
for 1. 1m Tange (num epocns) 
-y COSt = Sess.70n [model.optimizer, mocel.mean Cosel, 
reed, O1ce=(model «Asx Crain, 
model.ysy train}) 
Crewing COS cs «cp peiG.(cOs.) 


Perturn Lraining costs 


So, now we can create a new TensorFlow session to launch the 1rmodel.g graph and 
pass all the required arguments to the train linreg function for training: 


>>> sess = tf.Session(graph=lrmodel.g) 
Por CraeiImMing COScts = Crain, Janreq(sess, lrmooel, 2% train, yo train) 


Let's visualize the training costs after these 10 epochs to see whether the model 1s 
converged or not: 


2 o> AMDOrt MatoLollib.~pyploe. as pit 
Per Pits plot (range (li, len (training Costs) fF 1), Ereining costs) 
oor Po eeel ote. se oue.) 
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>>> plt.xlabel ('Epoch'") 
>>> plt.ylabel ('Training Cost') 
Pee pllssihow () 


As we can see 1n the following plot, this simple model converges very quickly after a 
few epochs: 


Training Cost 








So far so good. Looking at the cost function, it seems that we built a working 
regression model from this particular dataset. Now, let's compile a new function to 
make predictions based on the input features. For this function, we need the 
TensorFlow session, the model, and the test dataset: 


Get predict Jinreg (sess, model, x test): 
VY pred = Sess. run (mode! ..Z net, 
reed. Ciclt—{model. 47 Lest} ) 
feoLuUGe 7 pred 


Implementing a predict function was pretty straightforward; just running z net 
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defined in the graph computes the predicted output values. Next, let's plot the linear 
regression fit on the training data: 


Por DLisSCalver (x tlaim;,; Y train, 
marker='s', s=50, 

ee label='Training Data') 

vor Pit.plot (range (x train.shape|0)), 
predict Jinreg (sess, lrmodel, x~ train), 
color="gray', marker='o', 
markersize=6, lLinewidth=3, 

oan label='LinReg Model') 

>>> plt.xlabel ('x') 

>>> plt.ylabel('y') 

>>> plt.legend () 

Poe PlesliIgne 2eyoull) 

>>> plt.show() 


AS we can see in the resulting plot, our model fits the training data points 
appropriately: 





e@= LinReg Model 
®@ Training Data 
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Training neural networks efficiently 
with high-level TensorFlow APIs 


In this section, we will take a look at two high-level TensorFlow APIs—the Layers 
API (tensorflow. layers Or tf£.layers) and the Keras API 


(tensorflow.contrib.keras). 


Keras can be installed as a separate package. It supports Theano or TensorFlow as 
backend (for more information, refer to the official website of Keras at 


https://keras.10/). 


However, after the release of TensorFlow 1.1.0, Keras has been added to the 
TensorFlow contrib submodule. It 1s very likely that the Keras subpackage will be 
moved outside the experimental contrib submodule and become one of the main 
TensorFlow submodules soon. 
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Building multilayer neural networks using 
TensorFlow's Layers API 


To see what neural network training via the tensorflow. layers (tf.layers) high- 
level API looks like, let's implement a multilayer perceptron to classify the 
handwritten digits from the MNIST dataset, which we introduced in the previous 
chapter. The MNIST dataset can be downloaded from 
http://yann.lecun.com/exdb/mnist/ in four parts, as listed here: 


e Training set images: train-images-idx3-ubyte.gz (9.5 MB) 
e Training set labels: train-labels-idxl-ubyte.gz (32 KB) 

e Test set images: t10k-images-idx3-ubyte.gz (1.6 MB) 

Test set labels: t10k-labels-idx1l-ubyte.gz (8.0 KB) 


Note 


Note that TensorFlow also provides the same dataset as follows: 


import tensorflow as tf 
Erom. Censorl Low.Cxaemp les; tULOrLals.mnist. SMpOrL 1npuL. data 


However, we work with the MNIST dataset as an external dataset to learn all the 
steps of data preprocessing separately. This way, you would learn what you need to 
do with your own dataset. 


After downloading and unzipping the archives, we place the files in the mnist 
directory in our current working directory so that we can load the training as well as 
the test dataset, using the load _mnist (path, kind) function we implemented 
previously in Chapter 12, /mplementing a Multilayer Artificial Neural Network from 
Scratch. 


Then, the dataset will be loaded as follows: 


>>> ## loading the data 
>>> X train, y train = load mnist('./mnist/', kind='train') 
eo PEL ROWS: «CG, “Columns? <0" <(% train.shape [0], 
aa X train.shape[1])) 
Rows: 60000, Columns: 784 
>>> X test, y test = load mnist('./mnist/', kind='t10k') 
peor Prin ROWS: G;,. COlUmis: 2a" <{xX Uesl.shape (0), 

X test.shape[1]) ) 
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Rows: 10000, Columns: 784 
>>> ## mean centering and normalization: 


Por Mean. Vals = NpwmGan(xX train, <x1s=0) 

PoP SCC. Val = 1p.StGd (x. trei1) 

aoe 

>>> X train _centered = (X_train - mean _vals)/std val 
>>> X test centered = (X test - mean vals)/std val 
>>> 

Por Oe xX train, X™% Lest 

>>> 


vor Print ( x Ural Cenvercoishape, VY treain.~shnape) 
(60000, 784) (60000,) 

por PYEINC(X est Centered.shape,; yy test. shape) 
(10000, 784) (10000,) 


Now we can start building our model. We will start by creating two placeholders, 
named tf x andtf_y, and then build a multilayer perceptron as in Chapter 12, 
Implementing a Multilayer Artificial Neural Network from Scratch, but with three 
fully connected layers. 


However, we will replace the logistic units in the hidden layer with hyperbolic 
tangent activation functions (tanh), replace the logistic function 1n the output layer 
with softmax, and add an additional hidden layer. 


Note 


The tanh and softmax functions are new activation functions. We will learn more 
about these activation functions in the next section: Choosing activation functions 
for multilayer neural networks. 


import tensorflow as tf 


nn tearures = xX train centered. siape| 1] 
i Classes = 10 
fandom. seed — 1735 


Np .«Dendom. SSec (rancom seed) 


g = tf.Graph() 
Wien, Gra CCl oure() = 
EleseCl. random Scecatrangom seed) 
Cr x = Ciephacenolcer (Gl ype=CE.floatsZ, 
shape=(None, n features), 
Hames "icE x) 


ce .yY = tizplacenoloer(cLype-Ciw1amtoZ, 
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shape=None, name='tf y') 
Vy Onenol = Cl.one NoOriinoices=tl VY, Gepth=nm classes) 


hk = TEsbayers.dense(inpurs=_i xX, Unats=50, 
activation=tf.tanh, 
name='layerl') 


h2 = tf.layers.dense(inputs=hl, units=50, 
activation=tf.tanh, 
name='layer2') 


logits = tf. layers.dense(inputs=h2, 
units=10, 
activation=None, 
name='layer3') 


predictions = { 
'classes' : tf.argmax(logits, axis=l, 
hWeaMme="predireted, Classes”), 
'Orobabulicies* = tlatin,.sortmax (Logics, 


name='softmax tensor') 


Next, we define the cost functions and add an operator for initializing the model 
variables as well as an optimization operator: 


## define cost function and optimizer: 
WLED Gees. OSteuli(): 
COSE = TEs Osses.ScOLtlax Croce Surropy ( 
OnehOL. Japels=) ONenOL, Ogle S—=logiLs) 


optimizer = tf.train.GradientDescentOptimizer ( 
learning rate=0.001) 


Crein Op = Optimizer. minim 2S" 
loss=cost) 


tnt Op = UfeGlobal Varlables initializer) 
Before we start training the network, we need a way to generate batches of data. For 
this, we implement the following function that returns a generator: 


OSE, Creace. Dalch Generaror (xX, Vy, Davece Sive-l20, shurrle-Falee) 
xX Copy = Mp.sarray (x) 
y copy = np.array(y) 


1f shuffle: 
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Gata = -mp.column stack( (xX Copy, y copy)? 
np.random. shuffle (data) 

x COpy = Catals, 2=1 

VY Copy = GCatal*, -1Ll'sestype (ne) 


for 1 1m Bange (0, xAsShape Ol, Datch size); 
Viel. (% Copy |i si7batch size, ly yy Copy |4347baerch size) 


Next, we can create a new TensorFlow session, initialize all the variables in our 
network, and train it. We also display the average training loss after each epoch 
monitors the learning process later: 


>>> 
>>> 
>>> 
Poe 
>>> 
>>> 
>>> 


## Create a session to launch the graph 
sess = tf.Session(graph=g) 

## run the variable initialization operator 
SeSs~.FUn(1i1c Op) 


## 50 epochs of training: 
for epoch in range(50): 
thetnimg CoOsus = || | 
babe Gemeralor = Creare Dercn Generator 
x tlata Centered, ) train, 
baten si.ze-o4, Shuttile= True) 
fOr Daten x; Datch yy an. batch generator: 
## prepare a dict to feed data to our network: 
fee. = {Cl x2.Dacen xX, Ul -yebatch yy) 


_jf PebCch Cost = Sess.7Un( (Crain Op, cost), feed dict—feeq) 
Lreilning COsts.appera( batch Cost) 
Drie * == Peocn 20 ~ 


‘AVG. Teaining Loss? «.4f" ~~ { 
epoch ll, 1p.Mean(C raining -coscus) 


)) 


Epoch. |. Avg. Traznang boss: 1.5573 
Epoch 2 Avg:. Training Loess: 1.2532 
Epoch 3 Avg. Training Loss: 1.0854 
Fpoch 4 Avg. Training Loss: 0.9738 


Epoch 49 Avg. Training Loss: 0.3527 
Epoch 50 Avg. Training Loss: 0.3498 


The training process may take a couple of minutes. Finally, we can use the trained 
model to do predictions on the test dataset: 


>>> 


## do prediction on the test set: 
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Pee? Fee = (le x 2 % Beet Concerted) 
Soe Ye reO = Sele. Pitre elon | *elesees” |, 
er heed. Gicl—1eeq) 
>>> 
o> prince (lest ACCUrecy? c.Ziso’ «= 4 
100*np.sum(y pred == y test)/y test.shape[0]) ) 


Test Accuracy: 93.89% 


We can see that by leveraging high-level APIs, we can quickly build a model and 
test it. Therefore, a high-level API is very useful for prototyping our ideas and 
quickly checking the results. 


Next, we will develop a similar classification model for MNIST using Keras, which 
is another high-level TensorFlow API. 
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Developing a multilayer neural network with 
Keras 


The development of Keras started in the early months of 2015. As of today, it has 
evolved into one of the most popular and widely used libraries that is built on top of 
Theano and TensorFlow. 


Similar to TensorFlow, the Keras allows us to utilize our GPUs to accelerate neural 
network training. One of its prominent features is that it has a very intuitive and user- 
friendly API, which allows us to implement neural networks in only a few lines of 
code. 


Keras was first released as a standalone API that could leverage Theano as a 
backend, and the support for TensorFlow was added later. Keras 1s also integrated 
into TensorFlow from version 1.1.0. Therefore, if you have TensorFlow version 
1.1.0, no more installation 1s needed for Keras. For more information about Keras, 
visit the official website at http://keras.1o. 


Currently, Keras 1s part of the contrib module (which contains packages developed 
by contributors to TensorFlow and is considered experimental code). In future 
releases of TensorFlow, it may be moved to become a separate module in the 
TensorFlow main API. For more information, visit the documentation on the 


TensorFlow website at https://www.tensorflow.org/api_docs/python/tf/contrib/keras. 
Note 


Note that you may have to change the code from import 
tensorflow.contrib.keras as keras tO import tensorflow.keras as keras In 


future versions of TensorFlow in the following code examples. 


On the following pages, we will walk through the code examples for using Keras 
step by step. Using the same functions described in the previous section, we need to 
load the data as follows: 


>>> X train, y train = load mnist('mnist/', kind='train') 
Por PrInt( Rows: <Q, Columns? <a’ <(X% Erain.shape|0), 
a x trai n.snapel |) 
>>> X test, y test = load mnist('mnist/', kind='t1l0k') 
eer PRIDE ROWS: oG,;, COlUmnss <d* a(x Pest.shapel Ul, 
x LeSst.snape | ii), 


) 
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Rows: 10000, Columns: 784 


>>> 

>>> ## mean centering and normalization: 

Por Mean Vals = np.mean(xX train, axis=0) 

Por SLC. Val = Np.Staqx train) 

>>> 

>>> X train centered = (X train - mean vals)/std val 
>>> X test centered = (X test - mean vals)/std val 
>>> 

Zor OSI x Crain; XxX Vest 

>>> 


Por Prine (xX Crain Centered, shape, Y train«siape) 
(60000, 784) (60000,) 

Pee PLINu(X% Vest. Centered .shape;, y tese.shape) 
(10000, 784) (10000, ) 


First, let's set the random seed for NumPy and TensorFlow so that we get consistent 
results: 


>>> import tensorflow as tf 
>>> import tensorflow.contrib.keras as keras 


>>> np.random.seed (123) 
Por thasee Pancom seed (123) 


To continue with the preparation of the training data, we need to convert the class 
labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a 
convenient tool for this: 


Poe Yo train, Onenou = Ketas.UlLIsS:1o Caveqorical (y ‘Erain) 
a 

vor PEG’ Pare. >. abelse: “, Yi tiain lo) 

PiEse. 2 Labels: jo 0 4] 


Peo Print” \Firse 2 labels (One=nol)s\n", ytidin onenoe|! 73) 
First 3 labels (one-hot): 
[f 0. OO. O. OO. O. 1. +O. OO. 0. 
[f ts. Oe O.- Be. US ; ; : 
[ Ow Of Oe O.«% 2s ODO. DO. ODO. DBD. 


oO 
oO 
oO 

— 


Now, we can get to the interesting part and implement a neural network. Briefly, we 
will have three layers, where the first two layers each have 50 hidden units with the 
tanh activation function and the last layer has 10 layers for the 10 class labels and 
uses softmax to give the probability of each class. Keras makes these tasks very 
simple, as you can see in the following code implementation: 


model = keras.models.Sequential () 


WOW! eBook 
www.wowebook.org 


model .add ( 
keras.layers. Dense ( 
units=)50, 
Input Cam=xX Traim cenvered,shape| 1), 
Kernel seid Zeer" CLoror tiie roun 
bias 1n1 tielizer="Zeros’, 
activation='tanh') ) 


model .add ( 
keras.layers. Dense ( 
units=)50, 
En put, C1m= 50; 
kernel! initielizer=Glorol. Unsrorm’, 
Digs I11Cialazer="2er0s”, 
activation='tanh') ) 


model.add ( 
keras.layers. Dense ( 
UNnTtS=y Crain onehot.snape (1, 
input dim=50, 
Keine) Ani taglizer="CLorol. UniLormn g 
Dies thi tiala7er="Zeros’ , 
activation='softmax') ) 


Sod Optimizer — keras optim 7ers.oGD{ 
lr=0.001, decay=le-7, momentum=. 9) 


model.compile (optimizer=sgd optimizer, 
Loss="Celeqorcal Crossentropy”) 


First, we initialize a new model using the Sequential class to implement a 
feedforward neural network. Then, we can add as many layers to it as we like. 
However, since the first layer that we add is the input layer, we have to make sure 
that the input dim attribute matches the number of features (columns) in the training 
set (784 features or pixels in the neural network implementation). 


Also, we have to make sure that the number of output units (units) and input units 
(input dim) of two consecutive layers match. In the preceding example, we added 
two hidden layers with 50 hidden units plus one bias unit each. The number of units 
in the output layer should be equal to the number of unique class labels—the number 
of columns in the one-hot-encoded class label array. 


Note 
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Note that we used a new initialization algorithm for weight matrices by setting 
kernel initializer= 'glorot uniform'. Glorot initialization (also known as 
Xavier initialization) is a more robust way of initialization for deep neural networks 
(Understanding the difficulty of training deep feedforward neural networks, Xavier 
Glorot and Yoshua Bengio, 1n Artificial Intelligence and Statistics, volume 9, pages: 
249-256. 2010). The biases are initialized to zero, which 1s more common, and in 
fact the default setting in Keras. We will discuss this weight initialization scheme in 
more detail in Chapter 14, Going Deeper - The Mechanics of TensorFlow. 


Before we can compile our model, we also have to define an optimizer. In the 
preceding example, we chose a stochastic gradient descent optimization, which we 
are already familiar with from previous chapters. Furthermore, we can set values for 
the weight decay constant and momentum learning to adjust the learning rate at each 
epoch as discussed in Chapter 12, J/mplementing a Multilayer Artifiial Neural 
Network from Scratch. Lastly, we set the cost (or loss) function to 


Cacvegorical Crossentropy. 


The binary cross-entropy 1s just a technical term for the cost function in the logistic 
regression, and the categorical cross-entropy 1s its generalization for multiclass 
predictions via softmax, which we will cover in the section Estimating class 
probabilities in multiclass classification via the softmax function later in this chapter. 


After compiling the model, we can now train it by calling the fit method. Here, we 
are using mini-batch stochastic gradient with a batch size of 64 training samples per 
batch. We train the MLP over 50 epochs, and we can follow the optimization of the 
cost function during training by setting verbose=1. 


The validation split parameter 1s especially handy since it will reserve 10 percent 
of the training data (here, 6,000 samples) for validation after each epoch so that we 
can monitor whether the model is overfitting during training: 


veer MIEStOLy = Model. fit(x: train Cenverco, y train Onenor, 
Datch S1.76-64; epochs=50; 
verbose=l, 
validation split=0.1) 


Train on 54000 samples, validate on 6000 samples 

Epoch 1/50 

54000/54000 [=========$==$==$=$=$=$=$=$=$$=$=========] - 38s - loss: 0.7247 - 
Vel LOSS: 0. 0616 

Epoch 2/50 
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54000/54000 [S=sss==s======$=$=$=$=$=$=$=$=$=$=$========] - 3s - loss: 0.3718 - 
Val, 10SS%. Us25L5 

Epoch 3/50 

94000/54000 [S=ssss==SSssssssssSSSsssss======] - 3s - loss: 0.3087 - 
val loss: 0.2447 


eel 

Epoch 50/50 

54000/54000 [S=sss=s=s======$=$=$=$=$=$=$=$=$=$=$========] - 3s - loss: 0.0485 - 
Val. 105s: 0.1174 


Printing the value of the cost function is extremely useful during training. This is 
because we can quickly spot whether the cost is decreasing during training and stop 
the algorithm earlier, if otherwise, to tune the hyperparameter values. 


To predict the class labels, we can then use the predict classes method to return 
the class labels directly as integers: 
por VY Uraim pred = model.precuct. Classes (x train. Centered, Verbose) 


Zoe Pee Pee oS DreOtCetole, tf Y teeta Pre i.) ) 
First 3 predictions: [5 0 4] 


Finally, let's print the model accuracy on training and test sets: 


2°? ¥ Ulain pred = mocel.«preaice Classes(x. Crain cenvered, 
Sue verbose=0) 
Por - COULeCE preds = Np.sum(y train == y train precy, axis=0) 


>>> train acc = correct preds / y train.shape[0] 

fee 

pee Pele Pie 2 Prec ee One. “4 Vo thei. presi2oi, 
FLESE 3 Predielions: [5 0 4] 

ae 


eer Prine, Training accuracy: csZico” @ (erein ecco * 1L00)) 
TreaninG accuracy: 90.006 

a 

por Y Lest pred = model .predicr cClasses(™. test. centered, 
we verbose=0) 

po. COL Lect. preds = Tp.sumty test == y Test prea, axi16=0) 
>>> test acc = correct preds / y test.shape[0] 

eer Prime ISst SCCuracy;: <sZziaa = (test acco * 100)) 

Test accuracy: 96.04% 


Note that this 1s just a very simple neural network without optimized tuning 
parameters. If you are interested in playing more with Keras, feel free to further 
tweak the learning rate, momentum, weight decay, and number of hidden units. 
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Choosing activation functions for 
multilayer networks 


For simplicity, we have only discussed the sigmoid activation function 1n the context 
of multilayer feedforward neural networks so far; we used it in the hidden layer as 
well as the output layer in the multilayer perceptron implementation in Chapter 12, 
Implementing a Multilayer Artifiial Neural Network from Scratch. 


Although we referred to this activation function as a sigmoid function—as it 1s 
commonly called in literature—the more precise definition would be a logistic 
function or negative log-likelihood function. In the following subsections, you will 
learn more about alternative sigmoidal functions that are useful for implementing 
multilayer neural networks. 


Technically, we can use any function as an activation function in multilayer neural 
networks as long as it is differentiable. We can even use linear activation functions, 
such as in Adaline (Chapter 2, Training Simple Machine Learning Algorithms for 
Classification). However, 1n practice, 1t would not be very useful to use linear 
activation functions for both hidden and output layers since we want to introduce 
nonlinearity in a typical artificial neural network to be able to tackle complex 
problems. The sum of linear functions yields a linear function after all. 


The logistic activation function that we used in Chapter 12, /mplementing a 
Multilayer Artificial Neural Network from Scratch, probably mimics the concept of a 
neuron 1n a brain most closely—we can think of it as the probability of whether a 
neuron fires or not. 


However, logistic activation functions can be problematic if we have highly negative 
input since the output of the sigmoid function would be close to zero in this case. If 
the sigmoid function returns output that are close to zero, the neural network would 
learn very slowly and it becomes more likely that it gets trapped in the local minima 
during training. This is why people often prefer a hyperbolic tangent as an activation 
function in hidden layers. 


Before we discuss what a hyperbolic tangent looks like, let's briefly recapitulate 
some of the basics of the logistic function and look at a generalization that makes it 
more useful for multilabel classification problems. 
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Logistic function recap 


As we mentioned in the introduction to this section, the logistic function, often just 
called the sigmoid function, is in fact a special case of a sigmoid function. Recall 
from the section on logistic regression in Chapter 3, A Tour of Machine Learning 
Classifiers Using scikit-learn, that we can use a logistic function to model the 
probability that sample x belongs to the positive class (class 1) in a binary 
classification task. The given net input z is shown in the following equation: 


ym 


ee ey _ i EiSoe oe Con err ee 
Zz=w,x + WX, + 5 ee Wx =W Xx 


O° "ou 


The logistic function will compute the following: 


| 
Drovistic ( z ; 


l+e~ 


Ww, . , | _ , y=! | 
Note that © is the bias unit (y-axis intercept, which means © " ). To provide a 
more concrete example, let's assume a model for a two-dimensional data point x and 
a model with the following weight coefficients assigned to the w vector: 


>>> import numpy as np 


>>> X = np.array([1l, 1.4, 2.5]) ## first value must be 1 
>>> w = np.array([0.4, 0.3, 0.5]) 


>>> def net input(X, w): 
return np.dot(X, w) 


>>> def logistic(z): 
return 1.0 / (1.0 + np.exp(-z) ) 


ver GSt 1Og1stic ecClivallon(xz, WwW): 
Z = net input (X, w) 
return lLogistic(z) 
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>>> Princ’ Ply=Lix) = w.5f"° @ JO0IStIC ecLivalion (x, w)) 
P(y=l1|x) = 0.888 


If we calculate the net input and use it to activate a logistic neuron with those 
particular feature values and weight coefficients, we get a value of 0.888, which we 
can interpret as 88.8 percent probability that this particular sample x belongs to the 
positive class. 


In Chapter 12, Implementing a Multilayer Artificial Neural Network from Scratch, 
we used the one-hot-encoding technique to compute the values in the output layer 
consisting of multiple logistic activation units. However, as we will demonstrate 
with the following code example, an output layer consisting of multiple logistic 
activation units does not produce meaningful, interpretable probability values: 


>>> # W : array with shape = (n output units, n hidden units+1l) 
it note that the first column are the bias units 
Soo W = fo.,array (| [iel, dav, 0.8, -0s4), 
[Oets Cot, LO; W.-2 i), 
cea One tee eee Oe hy 
=> 
>>> # A : data array with shape = (n hidden units + 1, n_ samples) 
ff note that the first column of this array must be 1 


>>> A = np.array([[1, 0.1, 0.4, 0.6]]) 
Poe 
Poe Ly = NO<GoLiW,. ALO) ) 
aoe i PiOes = Loglelie 7) 
>>> print('Net Input: \n', Z) 
Net Input: 
| eto Oel@ theo 
>>> Prine ("OuLpuc Unats:\n", yy probas) 
OUTPUL. Units. 
; UesS5o69607T O,00135575. D.65609105] 


As we can see in the output, the resulting values cannot be interpreted as 
probabilities for a three-class problem. The reason for this is that they do not sum up 
to 1. However, this 1s in fact not a big concern if we only use our model to predict 
the class labels, not the class membership probabilities. One way to predict the class 
label from the output units obtained earlier is to use the maximum value: 


por 7 Class = Gpsergmax<(4, axie—0) 
Per Prine ("Prec Creo Class Iaoel: «<d” =@ y Class) 
Predicted class label: 0 
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In certain contexts, it can be useful to compute meaningful class probabilities for 
multiclass predictions. In the next section, we will take a look at a generalization of 
the logistic function, the softmax function, which can help us with this task. 


WOW! eBook 
www.wowebook.org 


Estimating class probabilities in multiclass 
classification via the softmax function 


In the previous section, we saw how we could obtain a class label using the argmax 
function. The softmax function is in fact a soft form of the argmax function; instead 
of giving a single class index, it provides the probability of each class. Therefore, it 
allows us to compute meaningful class probabilities 1n multiclass settings 
(multinomial logistic regression). 


In softmax, the probability of a particular sample with net input z belonging to the 
ith class can be computed with a normalization term in the denominator, that is, the 
sum of all MV linear functions: 


To see softmax in action, let's code it up in Python: 


>>> def softmax(z): 
return np.exp(z) / np.sum(np.exp(z) ) 


ver ¥ probas = soltmax (24) 
>>> print('Probabilities:\n', y probas) 
Probabilities: 

[ 0.44668973 0.16107406 0.39223621] 


Po? Nip. Sum (y Probes) 
1.0 


As we can see, the predicted class probabilities now sum up to 1, as we would 
expect. It is also notable that the predicted class label is the same as when we applied 
the argmax function to the logistic output. Intuitively, 1t may help to think of the 
softmax function as a normalized output that is useful to obtain meaningful class- 
membership predictions in multiclass settings. 
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Broadening the output spectrum using a 
hyperbolic tangent 


Another sigmoid function that is often used in the hidden layers of artificial neural 
networks is the hyperbolic tangent (commonly known as tanh), which can be 
interpreted as a rescaled version of the logistic function: 


Biwi ( = j | 


l+e~ 


ral al 


= 


eS y . EE 
Pant (z) — 2 x Pssisiie (2z) | = > 


e+e- 


The advantage of the hyperbolic tangent over the logistic function 1s that it has a 
broader output spectrum and ranges in the open interval (-1, 1), which can improve 
the convergence of the back propagation algorithm (Neural Networks for Pattern 
Recognition, C. M. Bishop, Oxford University Press, pages: 500-501, 1995). 


In contrast, the logistic function returns an output signal that ranges in the open 
interval (0, 1). For an intuitive comparison of the logistic function and the hyperbolic 
tangent, let's plot the two sigmoid functions: 


Poe AMDOLU- MactpLloclib.pypLlor. as: plc 


>>> def tanh(z): 


Sp = Npvexp (Zz) 
St. = Mp sero) 
return (e p -em) / (ep + em) 


So> 2 = Tie.arange(=5, 5, 0.005) 
Pe? 100 aCe = LOGist1ci7Z) 

yer tan ace = Lann (Zz) 

Peer Pits VLE Leo, 1.5) ) 

>>> Olin klabelL (net anput S75") 
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eo> Plisylabel (*ackivation. S\phi(z)6*) 
>>> plt.axhline(1l, color='"black', linestyle=':') 
pee Dilisaxmiane (0.5, Color=(biack", Dimestyle="s") 
>>> plt.axhline(Q, color='black', linestyle=':") 
>>> plt.axhline(-0O.5, color='"black', linestyle=':') 
>>> plt.axhline(-1l, color='black', linestyle=':') 
yer Pit«DlOl(Z,; Tann act, 

linewidth=3, linestyle='--', 
oe label='tanh') 
eer DPilsePplOUZ, 100. act, 

linewidth=3, 
rr label='logistic') 
>>> plt.legend(loc='"lower right") 
Poo PlLe«tione layout) 
Po Plt. Siow () 


As we can see, the shapes of the two sigmoidal curves look very similar; however, 
the tanh function has 2* larger output space than the logistic function: 


M 
= 
e 
mo 
= 
ma 
= 
b 
mo 


a a 
emmms [OQISCIC 


net input 2 





Note that we implemented the logistic and tanh functions verbosely for the 
purpose of illustration. In practice, we can use NumPy's tanh function to achieve the 
same results: 


27 tani acl = NOstanm(zZ) 


In addition, the logistic function 1s available in SciPy's special module: 
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>>> from scipy.special import expit 
277 LOG 2Ct = Gxp1t(Z) 
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Rectified linear unit activation 


Rectified Linear Unit (ReLU) is another activation function that is often used in 
deep neural networks. Before we understand ReLU, we should step back and 
understand the vanishing gradient problem of tanh and logistic activations. 


To understand this problem, let's assume that we initially have the net input 


e =720 I= 2) 


) , which changes to 


b(z,)=1.0 o(z,)=1.0 


. Computing the tanh activation, we get 
, which shows no change in the output. 


This means the derivative of activations with respect to net input diminishes as z 
becomes large. As a result, learning weights during the training phase become very 
slow because the gradient terms may be very close to zero. ReLU activation 
addresses this issue. Mathematically, ReLU 1s defined as follows: 


¢(z) = max (0,z) 


ReLU is still a nonlinear function that is good for learning complex functions with 
neural networks. Besides this, the derivative of ReLU, with respect to its input, 1s 
always | for positive input values. Therefore, 1t solves the problem of vanishing 
gradients, making it suitable for deep neural networks. We will use the ReLU 
activation function in the next chapter as an activation function for multilayer 
convolutional neural networks. 


Now that we know more about the different activation functions that are commonly 
used 1n artificial neural networks, let's conclude this section with an overview of the 
different activation functions that we encountered in this book: 
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PANoiiivécitle)am mel aleiiiele 


Unit Step 
(Heaviside 
Function) 


Sign 
(signum) 


Piece-wise 
Linear 


9(2)= 
1 


Logistic 
(sigmoid) 


Hyperbolic 
Tangent 
(tanh) 


Z* 7 


Equation 


4E6Z75S 


zZ2% 
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Example 1D Graph 


Adaline, linear 
regression 


Perceptron 
variant 


Perceptron 
variant 


y, Support vector 
2 Z 
machine 


Logistic 
regression, 
Multilayer NN 


Multilayer NN, 


Multilayer NN, 





Summary 


In this chapter, you learned how to use TensorFlow, an open source library for 
numerical computations with a special focus on deep learning. While TensorFlow is 
more inconvenient to use compared to NumPy, due to its additional complexity to 
support GPUs, it allows us to define and train large, multilayer neural networks very 
efficiently. 


Also, you learned about the TensorFlow API to build complex machine learning and 
neural network models and run them efficiently. First, we explored programming in 
the low-level TensorFlow API. Implementing models at this level may be tedious 
when we have to program at the level of matrix-vector multiplications and define 
every detail of each operation. However, the advantage is that this allows us as 
developers to combine such basic operations and build more complex models. 
Furthermore, we discussed how TensorFlow allows us to utilize the GPUs for 
training and testing big neural networks to speed up the computations. Without the 
use of GPUs, training some networks would typically need months of computation! 


We then explored two high-level APIs that make building neural network models a 
lot easier compared to the low-level API. Specifically, we used TensorFlow Layers 
and Keras to build the multilayer neural network and learned how to build models 
using those APIs. 


Finally, you learned about different activation functions and understood their 
behaviors and applications. Specifically, in this chapter, we saw tanh, softmax, and 
ReLU. In Chapter 12, Implementing a Multilayer Artificial Neural Network from 
Scratch, we started with implementing a simple Multilayer Perceptron (MLP) to 
classify a handwritten image in the MNIST dataset. While the low-level 
implementation from scratch was helpful to illustrate the core concepts of a 
multilayer neural network, such as the forward pass and backpropagation, training 
neural networks using NumPy 1s very inefficient and impractical for large networks. 


In the next chapter, we'll continue our journey and dive deeper into TensorFlow, and 
we'll find ourselves working with graph and session objects. Along the way, we'll 
learn many new concepts, such as placeholders, variables, and saving and restoring 
models in TensorFlow. 


WOW! eBook 
www.wowebook.org 


Chapter 14. Going Deeper — The 
Mechanics of TensorFlow 


In Chapter 13, Parallelizing Neural Network Training with TensorFlow, we trained a 
multilayer perceptron to classify MNIST digits, using various aspects of the 
TensorFlow Python API. That was a great way to dive us straight into some hands- 
on experience with TensorFlow neural network training and machine learning. 


In this chapter, we'll now shift our focus squarely on to TensorFlow itself, and 
explore in detail the impressive mechanics and features that TensorFlow offers: 


Key features and advantages of TensorFlow 

TensorFlow ranks and tensors 

Understanding and working with TensorFlow graphs 

Working with TensorFlow variables 

TensorFlow operations with different scopes 

Common tensor transformations: working with ranks, shapes, and types 
Transforming tensors as multidimensional arrays 

Saving and restoring a model in TensorFlow 

Visualizing neural network graphs with TensorBoard 


We'll stay hands-on 1n this chapter, of course, and implement graphs throughout the 
chapter to explore the main TensorFlow features and concepts. Along the way, we'll 
also revisit a regression model, explore neural network graph visualization with 
TensorBoard, and suggest some ways that you could explore visualizing more of the 
graphs that you'll make through this chapter. 
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Key features of TensorFlow 


TensorFlow gives us a scalable, multiplatform programming interface for 
implementing and running machine learning algorithms. The TensorFlow API has 
been relatively stable and mature since its 1.0 release in 2017. There are other deep 
learning libraries available, but they are still very experimental by comparison. 


A key feature of TensorFlow that we already noted in Chapter 13, Parallelizing 
Neural Network Training with TensorFlow, 1s its ability to work with single or 
multiple GPUs. This allows users to train machine learning models very efficiently 
on large-scale systems. 


TensorFlow has strong growth drivers. Its development 1s funded and supported by 
Google, and so a large team of software engineers work on improvements 
continuously. TensorFlow also has strong support from open source developers, who 
avidly contribute and provide user feedback. This has made the TensorFlow library 
more useful to both academic researchers and developers in their industry. A further 
consequence of these factors is that TensorFlow has extensive documentation and 
tutorials to help new users. 


Last but not least among these key features, TensorFlow supports mobile 
deployment, which makes it a very suitable tool for production. 
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TensorFlow ranks and tensors 


The TensorFlow library lets users define operations and functions over tensors as 
computational graphs. Tensors are a generalizable mathematical notation for 
multidimensional arrays holding data values, where the dimensionality of a tensor is 
typically referred to as its rank. 


We've worked mostly, so far, with tensors of rank zero to two. For instance, a scalar, 
a single number such as an integer or float, 1s a tensor of rank 0. A vector 1s a tensor 
of rank 1, and a matrix is a tensor of rank 2. But, 1t doesn't stop here. The tensor 
notation can be generalized to higher dimensions—as we'll see in the next chapter, 
when we work with an input of rank 3 and weight tensors of rank 4 to support 
images with multiple color channels. 


To make the concept of a tensor more intuitive, consider the following figure, which 
represents tensors of ranks O and | 1n the first row, and tensors of ranks 2 and 3 in 
the second row: 


Rank 0: [| Rank1:( | | | | | 


(scalar) (vector) 


Rank 2: (matrix) 


i 
tt 
| tt 
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How to get the rank and shape of a tensor 


We can use the tf£. rank function to get the rank of a tensor. It is important to note 
that t£.rank will return a tensor as output, and in order to get the actual value, we 
will need to evaluate that tensor. 


In addition to the tensor rank, we can also get the shape of a TensorFlow tensor 
(similar to the shape of a NumPy array). For example, if x is a tensor, we can get its 
shape using X.get shape (), which will return an object of a special class called 


Tensorshape. 


We can print the shape and use it directly for the shape argument when creating 
other tensors. However, we cannot index or slice this object directly. If we want to 
index or slice different elements of this object, then we can convert it into a Python 
list, using the as_1ist method of the tensor class. 


See the following examples on how to use the tf. rank function and the get shape 
method of a tensor. The following code example illustrates how to retrieve the rank 
and shape of the tensor objects in a TensorFlow session: 


>>> import tensorflow as tf 

>>> import numpy as np 

2 

oo? G = Li.Grapn() 

>>> 

>>> ## define the computation graph 

ore WER Gees ‘Oeraule) + 
## Gefine tensors tl, t2, t3 
tl = tf.constant (np.p1) 
tZ2 = tft.constant( (ll, 2, 3; 4)) 
ts = Uf.constanc(itl, zl, fo, 4)1) 


## get their ranks 
rl = tf.rank (tl) 
r2 = tf£.rank(t2) 
r3 = tf.rank(t3) 


## get their shapes 

si. = thygst Snape () 

sz = ©Z2.GeU. shape) 

So = €3.06t. shape) 
yee print (* shapes: *, Sl, S22, S3) 
Shapes: |.) (4,) 42% 2) 
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>>> with tf.Session(graph=g) as sess: 
print('Ranks:', 
rl.eval(), 
r2.eval(), 
r3.eval ()) 


Ranks: O 1 2 


As we can see, the rank of the t1 tensor is 0 since it is Just a scalar (corresponding to 
the [] shape). The rank of the t2 vector 1s 1, and since it has four elements, its shape 
is the one-element tuple (4, ). Lastly, the shape of the 2 x 2 matrix t3 1s 2; thus, its 
corresponding shape 1s given by the (2, 2) tuple. 
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Understanding TensorFlow's 
computation graphs 


TensorFlow relies on building a computation graph at its core, and it uses this 
computation graph to derive relationships between tensors from the input all the way 
to the output. Let's say, we have rank O (scalar) and tensors a, b, and c and we want 


z=2x(a-b)+c _,. | | 
to evaluate , : . This evaluation can be represented as a computation 


graph, as shown in the following figure: 


Computation graph implementing 
the equation z = 2x(a-b) +c 


a, b, c: input tensors (scalar) 


r,, Wr: intermediate result tensors 


z: tensor of the final result 





As we can see, the computation graph 1s simply a network of nodes. Each node 
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resembles an operation, which applies a function to its input tensor or tensors and 
returns zero or more tensors as the output. 


TensorFlow builds this computation graph and uses it to compute the gradients 
accordingly. The individual steps for building and compiling such a computation 
graph in TensorFlow are as follows: 


1. Instantiate a new, empty computation graph. 
2. Add nodes (tensors and operations) to the computation graph. 
3. Execute the graph: 

1. Start a new session 

2. Initialize the variables in the graph 

3. Run the computation graph in this session 


_ z=2x(a-—b)+c . . 
So let's create a graph for evaluating | : , aS Shown 1n the previous 


figure, where a, b, and c are scalars (single numbers). Here, we define them as 
TensorFlow constants. A graph can be created by calling t£.Graph(), then nodes 
can be added to it as follows: 


>>> g = tf.Graph () 


>> 

Poo Wien, CaaS CStTault |): 
a = tf.constant(l, name='a') 
b = tf.constant(2, name='b') 
c = tf.constant(3, name='c') 


Z = 2e(ea—)) + Cc 


In this code, we added nodes to the g graph using with g.as default (). If we do 
not explicitly create a graph, there is always a default graph, and therefore, all the 
nodes are added to the default graph. In this book, we try to avoid working with the 
default graph for clarity. This approach is especially useful when we are developing 
code in a Jupyter notebook, as we avoid piling up unwanted nodes in the default 
graph by accident. 


A TensorFlow session is an environment in which the operations and tensors of a 
graph can be executed. A session object is created by calling t£.Session that can 
receive an existing graph (here, g) as an argument, as in t£.Session(graph=g) ; 
otherwise, it will launch the default graph, which might be empty. 
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After launching a graph in a TensorFlow session, we can execute its nodes; that 1s, 
evaluating its tensors or executing its operators. Evaluating each individual tensor 
involves calling its eval method inside the current session. When evaluating a 
specific tensor in the graph, TensorFlow has to execute all the preceding nodes in the 
graph until it reaches that particular one. In case there are one or more placeholders, 
they would need to be fed, as we'll see later 1n the next section. 


Quite similarly, executing operations can be done using a session's run method. In 
the previous example, train op 1s an operator that does not return any tensor. This 
operator can be executed as train op.run(). Furthermore, there is a universal way 
of running both tensors and operators: t£.Session() .run(). Using this method, as 
we'll see later on as well, multiple tensors and operators can be placed in a list or 
tuple. As a result, t£.Session().run() will return a list or tuple of the same size. 


Here, we will launch the previous graph in a TensorFlow session and evaluate the 
tensor z as follows: 


>>> with tf.Session(graph=g) as sess: 
anes Primt("’2* (a=b) +6 => ", SESS. run(z):) 
2Ela-D) ro => 


Remember that we define tensors and operations 1n a computation graph context 
within TensorFlow. A TensorFlow session is then used to execute the operations in 
the graph and fetch and evaluate the results. 


In this section, we saw how to define a computation graph, how to add nodes to it, 
and how to evaluate the tensors in a graph within a TensorFlow session. We'll now 
take a deeper look into the different types of nodes that can appear in a computation 
graph, including placeholders and variables. Along the way, we'll see some other 
operators that do not return a tensor as the output. 
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Placeholders in TensorFlow 


TensorFlow has special mechanisms for feeding data. One of these mechanisms 1s 
the use of placeholders, which are predefined tensors with specific types and shapes. 


These tensors are added to the computation graph using the t£.placeholder 
function, and they do not contain any data. However, upon the execution of certain 
nodes in the graph, these placeholders need to be fed with data arrays. 


In the following sections, we'll see how to define placeholders 1n a graph and how to 
feed them with data values upon execution. 
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Defining placeholders 


As you now know, placeholders are defined using the t£.placeholder function. 
When we define placeholders, we need to decide what their shape and type should 
be, according to the shape and type of the data that will be fed through them upon 
execution. 


Let's start with a simple example. In the following code, we will define the same 


| | | _ z=2x(a-b)+c _,. 
graph that was shown in the previous section for evaluating , : . This 


time, however, we use placeholders for the scalars a, b, and c. Also, we store the 


anf K, 
intermediate tensors associated with ' and 2, as follows: 


>>> import tensorflow as tf 
>>> 
eer GO = Ligcraph() 
Por Wien Cxeas Ceraulte(): 
tt 2 = UCl.placenoloer(TE.1ntoZ, Siape=|i iy 
Neme="Er a") 
Ce = ties cen er liste, Sale —), 
Name="Ti D*) 
tf£.placeholder(tf.int32, shape=[], 
neame=" Er 0") 


ct 

” 
Q 
| 


= i Ore. 
ie ee ee, 
Z =r2+ cane: 


In this code, we defined three placeholders, named tf a, tf b, and tf c, using type 
tf.int32 (32-bit integers) and set their shape via shape=[] since they are scalars 
(tensors of rank 0). In the current book, we always precede the placeholder objects 
with t£_ for clarity and to be able to distinguish them from other tensors. 


Note that in the previous code example, we were dealing with scalars, and therefore, 
their shapes were specified as shape=[]. However, it is very straightforward to 
define placeholders of higher dimensions. For example, a rank 3 placeholder of type 
float and shape 3 x 4x 5 can be defined as t£.placeholder (dtype=tf.float32, 
shape=[2, 3, 4]). 
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Feeding placeholders with data 


When we execute a node in the graph, we need to create a python dictionary to feed 
the values of placeholders with data arrays. We do this according to the type and 
shape of the placeholders. This dictionary is passed as the input argument feed dict 
to a session's run method. 


In the previous graph, we added three placeholders of the type t£.int32 to feed 
scalars for computing z. Now, in order to evaluate the result tensor z, we can feed 
arbitrary integer values (here, 1, 2, and 3) to the placeholders, as follows: 


>>> with tf.Session(graph=g) as sess: 
fee = 425 ae Ly 
Boe Za 
cr Oe 24 
Princ ('Zs*,; 
Sess.2tUn (2, Peed. O1cr= feed) ) 


This means that having extra arrays for placeholders does not cause any error; it 1s 
just redundant to do so. However, if a placeholder is needed for the execution of a 
particular node, and 1s not provided via the feed dict argument, it will cause a 
runtime error. 
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Defining placeholders for data arrays with 
varying batchsizes 


Sometimes, when we are developing a neural network model, we may deal with 
mini-batches of data that have different sizes. For example, we may train a neural 
network with a specific mini-batch size, but we want to use the network to make 
predictions on one or more data input. 


A useful feature of placeholders is that we can specify None for the dimension that is 
varying in size. For example, we can create a placeholder of rank 2, where the first 
dimension is unknown (or may vary), as shown here: 


>>> import tensorflow as tf 

>>> 

22> G. = Ti «Grapn |) 

22> 

Zee WLM Gees CSlTauLt (ji 

Ce & = Leseplacenoloer (Ci. Oat o7, 

shape=[None, 2], 
Name="tE x) 


x Mean = Ci.reauce. Meanter x, 
ax1is=O, 
name='mean' ) 


Then, we can evaluate x mean with two different input, x1 and x2, which are NumPy 
arrays of shape (5, 2) and (10, 2), as follows: 


7o>> IMpOre, NuUMpy as. Tip 
>>> np.random.seed (123) 
Por Nese. Dime Opelons (precise 1On=Z) 
>>> with tf.Session(graph=g) as sess: 
xl = np.random.uniform(low=0, high=1, 
Size=(5, 2)) 
print('Feeding data with shape ', xl.shape) 
Print( Results’, Sess.1Un (x mean, 
feeo GO1GE=(tl xe x1) )) 
x2 = np.random.uniform(low=0, high=1, 
size=(10,2)) 
print('Feeding data with shape', x2.shape) 
PEInt( Results”, SeSs.7Un(s mean, 
Feed Cicti={Cf x? x2) )) 
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This prints the following output: 


Feeding data with shape (5, 2) 


Result: [ 0.62 0.47] 
Feeding data with shape (10, 2) 
Result: [ 0.46 0.49] 


Lastly, if we try printing the object tf x, we will get Tensor ("tf x:0", shape=(?, 
2), dtype=float32), which shows that the shape of this tensor is (?, 2). 
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Variables in TensorFlow 


In the context of TensorFlow, variables are a special type of tensor objects that allow 
us to store and update the parameters of our models in a TensorFlow session during 
training. The following sections explain how we can define variables in a graph, 
initialize those variables 1n a session, organize variables via the so-called variable 
scope, and reuse existing variables. 
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Defining variables 


TensorFlow variables store the parameters of a model that can be updated during 
training, for example, the weights in the input, hidden, and output layers of a neural 
network. When we define a variable, we need to initialize 1t with a tensor of values. 
Feel free to read more about TensorFlow variables at 


https://www.tensorflow.org/programmers _guide/variables. 


TensorFlow provides two ways for defining variables: 


® tf.Variable (<initial-value>, name="variable-name") 


© tf.get. Variable (fame, ««.«) 


The first one, t£.Variable, 1s a class that creates an object for a new variable and 
adds it to the graph. Note that t£.Variable does not have an explicit way to 
determine shape and dtype; the shape and type are set to be the same as those of the 
initial values. 


The second option, t£.get variable, can be used to reuse an existing variable with 
a given name (if the name exists in the graph) or create a new one 1f the name does 
not exist. In this case, the name becomes critical; that's probably why it has to be 
placed as the first argument to this function. Furthermore, tf£.get variable 
provides an explicit way to set shape and dt ype; these parameters are only required 
when creating a new variable, not reusing existing ones. 


The advantage of tf. get variable over tf£.Variable 1s twofold: tf£.get variable 
allows us to reuse existing variables it already uses the popular Xavier/Glorot 
initialization scheme by default. 


Besides the initializer, the get variable function provides other parameters to 
control the tensor, such as adding a regularizer for the variable. If you are interested 
in learning more about these parameters, feel free to read the documentation of 


tf.get variable at https://www.tensorflow.org/api_docs/python/tf/get_ variable. 
Note 


Xavier (or Glorot) initialization 


In the early development of deep learning, it was observed that random uniform or 
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random normal weight initialization could often result in a poor performance of the 
model during training. 


In 2010, Xavier Glorot and Yoshua Bengio investigated the effect of initialization 
and proposed a novel, more robust initialization scheme to facilitate the training of 
deep networks. 


The general idea behind Xavier initialization is to roughly balance the variance of 
the gradients across different layers. Otherwise, one layer may get too much 
attention during training while the other layer lags behind. 


According to the research paper by Glorot and Bengio, if we want to initialize the 
weights from uniform distribution, we should choose the interval of this uniform 


distribution as follows: 
rae V6 Jo 
W ~Unijorm| -—————., ———= 





i. . : Sas . 
Here, ‘ is the number of input neurons that are multiplied with the weights, and 


"ow ig the number of output neurons that feed into the next layer. For initializing the 


weights from Gaussian (normal) distribution, the authors recommend choosing the 


Pp 


oS 
. | | i, +n, 
standard deviation of this Gaussian to be = —— 


TensorFlow supports Xavier initialization in both uniform and normal distributions 
of weights. The documentation provides detailed information about using Xavier 
initialization with TensorFlow: 
https://www.tensorflow.org/api_docs/python/tf/contrib/layers/xavier_initializer. 





For more information about Glorot and Bengio's initialization scheme, including the 
mathematical derivation and proof, read their original paper (Understanding the 
difficulty of deep feedforward neural networks, Xavier Glorot and Yoshua Bengio, 
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2010), which is freely available at 
http://proceedings.mIr.press/v9/glorot10a/glorot]0a.pdf- 


In either initialization technique, it's important to note that the initial values are not 
set until we launch the graph in t£.Session and explicitly run the initializer operator 
in that session. In fact, the required memory for a graph 1s not allocated until we 
initialize the variables in a TensorFlow session. 


Here 1s an example of creating a variable object where the initial values are created 
from a NumPy array. The dt ype data type of this tensor is t£.int64, which is 
automatically inferred from its NumPy array input: 


>>> import tensorflow as tf 
Poo AMpOre. NuMmpy as: Np 
a 
Zor Ol = Ei.Grapn() 
>>> 
Por WL, Gilseas Cefauli() 

w = tf.Variable(np.array([[l, 2, 3, 4], 

[5, 6, 7, 8]]), name='w') 

a print (w) 
<ti.Varileable "wr" Shape=(Z, 4) GLype=invos Ter> 
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Initializing variables 


Here, it is critical to understand that tensors defined as variables are not allocated in 
memory and contain no values until they are initialized. Therefore, before executing 
any node in the computation graph, we must initialize the variables that are within 
the path to the node that we want to execute. 


This initialization process refers to allocating memory for the associated tensors and 
assigning their initial values. TensorFlow provides a function named 

tf.global variables initializer that returns an operator for initializing all the 
variables that exist in a computation graph. Then, executing this operator will 
initialize the variables as follows: 


>>> with tf.Session(graph=gl) as sess: 


SeSssstun(tt.Globel Vartiavles 191 tial 7er () ) 
print (sess.run (w) ) 


We can also store this operator in an object such as init op = 

tf£.global variables initializer() and execute this operator later using 
sess.run(init op) Of init op.run(). However, we need to make sure that this 
operator is created after we define all the variables. 


For example, in the following code, we define the variable wi, then we define the 
operator init op, followed by the variable w2: 


>>> import tensorflow as tf 


PS 

>>> g2 = tf£.Graph() 

So 

Por WEEN, -G2<as OsTtault() : 
wl = tf£.Variable(l1, name='wl') 
intl, OD = PieGlooel Vartebles 2nitielizer() 
w2 = tf£.Variable(2, name='w2") 


Now, let's evaluate wi as follows: 


>>> with tf.Session(graph=g2) as sess: 
Sess 27 (10 Op) 

aes print('wl:', sess.run(wl) ) 

Woe 


WOW! eBook 
www.wowebook.org 


This works fine. Now, let's try evaluating w2: 


>>> with tf.Session(graph=g2) as sess: 
Sess. TUn(1 1 Op) 
print('w2:', sess.run(w2) ) 
FailledPreconditionkError 
Attempting to use uninitialized value w2 
[{[Node: retval w2 0 0 = Retval[|T=DT INT32, index=0, 
_device="/job:localhost/replica:0/task:0/cpu:0"] (w2) J] 


As shown in the code example, executing the graph raises an error because w2 was 
not initialized via sess. run(init op), and therefore, couldn't be evaluated. The 
operator init op was defined prior to adding w2 to the graph; thus, executing 
init op will not initialize w2. 
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Variable scope 


In this subsection, we're going to discuss scoping, which is an important concept in 
TensorFlow, and especially useful if we are constructing large neural network 
graphs. 


With variable scopes, we can organize the variables into separate subparts. When we 
create a variable scope, the name of operations and tensors that are created within 
that scope are prefixed with that scope, and those scopes can further be nested. For 
example, if we have two subnetworks, where each subnetwork has several layers, we 
can define two scopes named 'net A' and 'net B', respectively. Then, each layer 
will be defined within one of these scopes. 


Let's see how the variable names will turn out 1n the following code example: 


>>> import tensorflow as tf 
>>> 
>>> g = tir.Graph() 
>>> 
por Wien Gees OStaeulLed) 
WLER: Tiavariaole scope. net AY): 
With Eiavallenkbe Scope( tavyer=1"): 
wl = tf. Variable(ti.random normal ( 
shape=(10,4)), name='weights') 
WEE TiseVeariable: Scope |(* lavyer-2Z)= 
w2 = tf£.Variable(tf.random normal ( 
shape=(20,10)), name='weights') 
WLM Ci sevalleable Scopel net B*)% 
W1LtH CisaVetiable Scope( layer=i"): 
Wo = Lf Vearleble(L’.Fandom normal 
shape=(10,4)), name='weights') 


print (wl) 
print (w2) 
print (w3) 


<tf£.Variable 'net A/layer-l/weights:0' shape=(10, 4) dtype=float32 ref> 
<tf.Variable 'net A/layer-2/weights:0' shape=(20, 10) 
OCYPS=—LOatoZ ret> 

<tf£.Variable 'net B/layer-1l/weights:0' shape=(10, 4) dtype=float32 ref> 


Notice that the variable names are now prefixed with their nested scopes, separated 
by the forward slash (/) symbol. 
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Note 


For more information about variable scoping, read the documentation at 


https://www.tensorflow.org/programmers guide/variable_ scope and 
https://www.tensorflow.org/api_docs/python/tf/variable_scope. 
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Reusing variables 


Let's imagine that we're developing a somewhat complex neural network model that 
has a classifier whose input data comes from more than one source. For example, 


My XasV 
we'll assume that we have data (Xa0¥4) coming from source A and data (X5.¥s) 


comes from source B. In this example, we will design our graph in such a way that it 
will use the data from only one source as input tensor to build the network. Then, we 
can feed the data from the other source to the same classifier. 


In the following example, we assume that data from source A is fed through 
placeholder, and source B is the output of a generator network. We will build by 
calling the build generator function within the generator scope, then we will add 
a Classifier by calling build classifier within the classifier scope: 


>>> import tensorflow as tf 
>>> 
>>> FHREEE TEE EEE EEE EE EE HEH EF 
it Helper functions it # 
0. HEHE E EH HEE HE EEE HH EH HEE 
>> S 
Zor OSL UIC (Clacsi Mer (Gata, Meabele, Mm Classes=Z). 
Gata Shape = data.Get siape().as J1SC() 
weights = tf.get variable (name='weights', 
Sstiape— (data shape |i), 
It. ChLasses).y 
dtype=tf.float32) 
bids = ti.Oel Vartleole(nane—"bias 
Ihitite lI Zer= tr. Zeros ( 
Sshape=n. Clasees,:) 
logits = tf.add(tf.matmul (data, weights), 
bias, 
name='logits') 
ao return logits, tf.nn.softmax(logits) 
>>> 
>>> 
Por Get bulla Generavor (data, mm Nidden) 
Odta Stiape = Odta.GSt snape() «as: J15t{) 
wl = tf.Variable ( 
tfi.random normal (shape=(caata shape|1], 
i NaCen).), 
name='wl1') 
bi = (i. Vel leble(li «Zeros (Slape=n 11dden), 
name='bl1') 
hidden = tf.add(tf.matmul (data, wl), bl, 
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>>> 
>>> 


>>> 
>>> 
Le 
>>> 
Poe 


name='hidden pre-activation') 
Nnocen = Lisi. telu(nicden, ~“hlocen activation” ) 


w2 = tf£.Variable ( 
tf.random normal (shape=(n hidden, 
data shape[1])), 
name='w2') 
bZ = (1i.Verleable(tt.2eros (shape=-date siape |i.) ), 


name='b2"') 
output = tf.add(tf.matmul (hidden, w2), b2, 
name = 'output') 


return output, tf.nn.sigmoid(output) 


tt H HHH HH EH HEH HE EH HEH HEE HF 
## Build the graph tt # 
tt H HHH H HEH HEH HE EH HF EH HEE HF 


batch size=6o4 
g = tf.Graph() 


WLED G«as GCefault) >: 
to. = PEepLacenoiGer stape=(o2bCn. S176, 200) z 
dtype=tf.float32, 
name="cr x”) 


## build the generator 
WIG. Cle Verleble Scopet Generacor”): 
Get OuCl = Duild generealrOor(data=ti x; 
i Oden =o)) 


## build the classifier 
WEEN Clsvalieole Scope ( Clacesi tier) <6 2cOpe: 
## Classifier for the original data: 
CLS. OULL = Duidd Classifier (davta=cl x, 
labels=tf.ones ( 
Sshape=batch 6176) ) 


## reuse the classifier for generated data 
SCcOpe. reuse variables) 
Gils. OULZ = DuUaid Classifier (cata-Jon our ii, 
Lee ks= UL «Zeros 
shape=batch S176) ) 


Notice that we have called the build classifier function two times. The first call 
causes the building of the network. Then, we call scope. reuse variables () and 
call that function again. As a result, the second call does not create new variables; 
instead, it reuses the same variables. Alternatively, we could reuse the variables by 


WOW! eBook 
www.wowebook.org 


specifying the reuse=True parameter, as follows: 


>>> g = tif.Graph() 
>>> 
Pep Wesce Geos CeoralL 
ce & = Ti.placenolocr (Ssnape=(batcm S176, 100), 
dtype=tf.float32, 
NWame= ier. x") 
## build the generator 
WELLE tEsVerleble SCopet*Ceneralor); 
Gem Outl = Dull GeneralOor(data=er x, 
My NAoden=o0) 


## build the classifier 
WILM Cl.Veatleble Scope (*Classi fier”); 
## Classifier for the original data: 
cls. OU) = Diuild CleassiiieriGata=ck x, 
labels=tf.ones ( 
Shape=batch S126) ) 


With Ci sVartenle Scope (’ Classifier’, Leuse-—Triie) 
## reuse the classifier for generated data 
Gls. OULZ = Dinid Clas oitIerieala-gon CUE Ti, 
Labels =CL«Zeroe 4 
Shape=batch 61.76) ) 


Note 


While we have discussed how to define computational graphs and variables in 
TensorFlow, a detailed discussion of how we can compute gradients in a 
computational graph is beyond the scope of this book, where we use TensorFlow's 
convenient optimizer classes that perform backpropagation automatically for us. If 
you are interested in learning more about the computation of gradients in 
computational graphs and the different ways to compute them in TensorFlow, please 
refer to the PyData talk by Sebastian Raschka at https://github.com/rasbt/pydata- 
annarbor2017-dl-tutorial. 
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Building a regression model 


Since we've explored placeholders and variables, let's build an example model for 
regression analysis, similar to the one we created in Chapter 13, Parallelizing Neural 
Network Training with TensorFlow, where our goal 1s to implement a linear 


wm, 


y=wxtb 





regression model: 


In this model, w and b are the two parameters of this simple regression model that 
need to be defined as variables. Note that x is the input to the model, which we can 
define as a placeholder. Furthermore, recall that for training this model, we need to 
formulate a cost function. Here, we use the Mean Squared Error (MSE) cost 
function that we defined in Chapter 10, Predicting Continuous Target Variables with 


Regression Analysis ii 


Here, y is the true value, which is given as the input to this model for training. 


Therefore, we need to define y as a placeholder as well. Finally, ”’ is the prediction 
output, which will be computed using TensorFlow operations—t f.matmul and 
t£.add. Recall that TensorFlow operations return zero or more tensors; here, 
tf.matmul and tf.add return one tensor. 


We can also use the overloaded operator + for adding two tensors; however, the 
advantage of t£.add 1s that we can provide an additional name for the resulting 
tensor via the name parameter. 


So, let's summarize all our tensors with their mathematical notations and coding 
naming, as follows: 


e Input x: t£ x defined as a placeholder 

e Input y: t£ y defined as a placeholder 

e Model parameter w: weight defined as a variable 
e Model parameter 5: bias defined as a variable 


e Model output y : y hat returned by the TensorFlow operations to compute the 
prediction using the regression model 


The code to implement this simple regression model 1s as follows: 
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>>> import tensorflow as tf 
>>> import numpy as np 
>>> 
>>> g = tir.Graph() 
>>> 
Por Wilk Cees Ceotauli.) = 
tf.set random seed (123) 
## placeholders 
tf x = ti.placeholder (shape=(None), 
dtype=tf.float32, 
name="t x") 
ti Y = TE.placenoloer (snape—(None), 
dtype=tf.float32, 
Name="cL. y) 


## define the variable (model parameters) 
weight = tf.Variable ( 
tf.random normal ( 
Shape=(1, 1), 
stddev=0.25)), 
name='weight') 
bias = tf£.Variable(0.0, name='bias'") 


## build the model 
Vo lee. = Cieocd (weight * Ul x, Dias, 
name="y- nat”) 


## compute the cost 
COst: = UE.2eauce Meanttr..ouare(tr y = y Nav), 
hame="COst’:) 


## train the model 

optim = tfi.train.GradientDescentOptimizer ( 
igarnang race=0..001) 

Lhain op = Oplim.MinimizetCost, Name="train- Op) 


Now that we've built the graph, our next steps are to create a session to launch the 
graph and train the model. But before we go further, let's see how we can evaluate 
tensors and execute operations. We'll create a random regression data with one 
feature, using the make random data function and visualizing the data: 


>>> ## create a random toy dataset for regression 
>>> 

>>> import numpy as np 

22> IMport MatpLlLotlLib.pyplor as ple 

>>> np.random.seed (0) 
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ae 
77? Cer Make Yanoom data): 


xX = np.random.uniform(low=-2, high=4, size=200) 
= ib 
Lor ai 3 

r = np.random.normal (loc=0.0, 


scale=(0.5 + t*t/3), 
size=None) 


y.append (r) 
se return “x, 1./26*x =0.84 + nNp.darray(y) 
aoe 
oS 
>>> X, y = make random data () 
Poe 


>> PLE«ploutx, vy, *o*) 
>>> plt.show() 


The following figure shows the random regression data that we generated: 





Now we're ready; let's train the previous model. Let's start by creating a TensorFlow 
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session object called sess. Then, we want to initialize our variables which, as we 
saw, we can do with sess.run(tf£.global variables initializer ()). After this, 
we can create a for loop to execute the train operator and calculate the training cost 
at the same time. 


So let's combine the two tasks, the first to execute an operator, and the second to 
evaluate a tensor, into one sess.run method call. The code for this is as follows: 


>>> ## train/test splits 

Per & TYein,; yy train =— x[ L200), vireo) 

oor © eect; YY cece = SLILOO Tl, Vio. 

Lor 

Poe 

>>> TM. epochs = O00 

So Cie COctS [ 

>>> with tf.Session(graph=g) as sess: 
SeSsseTUN(tr-Global Variables initializer) ) 


## train the model for n_ epochs 
for © if Fange(n epocis): 
Ce = Sese.,£UNC Cost, rai opi, 
feed Cict={ti xX? xX train, 
tr ye YY. train ) 
Erealning .COsues..appena(c) 
iE MOC 6 ~@ 50% 
( 


ee print('Epoch %4d: %.4f£' % (e, c)) 
Epoch Os -Iw.2250 
Epoch OW Oh 6-10 
Epoch. LOOY 6.5721 
Fpoch 150: 5.6844 
BPOOCh 2007 322269 
Epoch Zot 42.9725 
Fpoch 300: 4.8169 
Epoch o50¢ 4,7119 
Fpoch 400: 4.6347 
Fpoch 450: 4.5742 


eee Debs POU Erato: COsts) 
Po > PLE «SNOW () 


The code generates the following graph that shows the training costs after each 
epoch: 
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Executing objects in a TensorFlow 
craph using their names 


Executing variables and operators by their names 1s very useful in many scenarios. 
For example, we may develop a model 1n a separate module; and thus the variables 
are not available in a different Python scope according to Python scoping rules. 
However, 1f we have a graph, we can execute the nodes of the graph using their 
names in the graph. 


This can be done easily by changing the sess. run method from the previous code 
example, using the variable name of the cost in the graph rather than the Python 
variable cost by changing sess.run([cost, train op], ...) to 


SeSseeUn ti COSt.O", “Etat OO le <<a) 


>>> n_ epochs = 500 

Pe eel OSes = J 

>>> with tf.Session(graph=g) as sess: 
## first, run the variables initializer 
SesSe-eTUn(Ur.G bobal Variables 101 eie livery) ) 


## train the model for n_eopchs 
fOr © in, Lange (nn. Spocns): 
Cy, = Ses5.700 ("Costs 0", “Clan oD), 
Peed. Ler. "Er Ue Peay 
‘CE Ve0. sy Crain) ) 
Lielmimg COSts.appena{c) 
1£ et50 == 
print('Epoch {:4d} : {:.4f}' 
»-LOrmMat(e, Cc) ) 


Notice that we are evaluating the cost by its name, which is 'cost:0', and executing 
the train operator by its name: 'train op'. Also, in feed dict, instead of using 
tf x: x train, we are using 'tf x:0': x train. 


Note 


If we pay attention to the names of the tensors, we will notice that TensorFlow adds 
a suffix ':0' to the name of the tensors. 


However, the names of operators do not have any suffix like that. When a tensor 
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with a given name, such as name='my_ tensor’, 1s created, TensorFlow appends 
':0'; so the name of this tensor will be 'my tensor:0'. 


Then, if we try to create another tensor with the same name in the same graph, 
TensorFlow will append ' 1:0' and so on to the name; therefore, the future tensors 
will be named 'my tensor 1:0', 'my tensor 2:0', and so on. This naming 
assumes that we are not trying to reuse the already created tensor. 
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Saving and restoring a model in 
TensorFlow 


In the previous section, we built a graph and trained it. How about doing the actual 
prediction on the held out test set? The problem is that we did not save the model 
parameters; so, once the execution of the preceding statements are finished and we 
exit the t£.Session environment, all the variables and their allocated memories are 
freed. 


One solution is to train a model, and as soon as the training is finished, we can feed it 
our test set. However, this 1s not a good approach since deep neural network models 
are typically trained over multiple hours, days, or even weeks. 


The best approach 1s to save the trained model for future use. For this purpose, we 
need to add a new node to the graph, an instance of the t£.train.Saver class, which 
we Call saver. 


In the following statement, we can add more nodes to a particular graph. In this case, 
we are adding saver to the graph g: 


Jo? Wit OCves Cefault()-: 
saver = tf.train.Saver () 


Next, we can retrain the model with an additional call to saver.save() to save the 
model as follows: 


Pee DM: Spochs =. 500 

yee tre tnG Coste = |) 

>>> with tf.Session(graph=g) as sess: 
Sess.JUm(tt.global Variables Anitializer()) 


## train the model for n epochs 
for e in range(n epochs) : 
Gy, .. = Sese.2Un (Cost, Crain op |, 
feed -cict={ti x? x _Urain, 
ci Va-Y Crean) 
Llane. COSts.appena(c) 
rf not €e 6 SO: 
DLElNEe(* Hooch c4di «14f" @ (Ee, c)} 


Saver.save(sess, './trained-model') 
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As a result of this new statement, three files are created with extensions . data, 
.index, and .meta. TensorFlow uses Protocol Buffers 


(https://developers.google.com/protocol-buffers/), which is a language-agnostic way, 
for serializing structured data. 


Restoring a trained model requires two steps: 


1. Rebuild the graph that has the same nodes and names as the saved model. 
2. Restore the saved variables in a new tf.Session environment. 


For the first step, we can run the statements, as we did in the first place, to build the 
graph g. But there 1s a much easier way to do this. Note that all of the information 
regarding the graph is saved as metadata in the file with the .meta extension. Using 
the following code, we rebuild the graph by importing it from the meta file: 


>>> with tf.Session() as sess: 
New Saver = ti sbrain.«1mport meta grapn( 
'. /trained-model.meta') 


The t£.train.import meta graph function recreates the graph that is saved in the 
'./trained-model.meta' file. After recreating the graph, we can use the new saver 
object to restore the parameters of the model in that session and execute it. The 
complete code to run the model on a test set is as follows: 


>>> import tensorflow as tf 
>>> import numpy as np 
ee 
>>> g2 = tf.Graph() 
>>> with tf.Session(graph=g2) as sess: 
new Saver = tf.train.import. meta grapn ( 
',/trained-model.meta') 
Mew Sever.vestore (sess, *./traaned-=model”) 


Vy Dred = sess.7un(*y narv:0”, 
eso Clect=|" tL Ae" s xX Vest) 


Note that we evaluated the ” tensor by its name that was given previously: 

'y hat:0'. Also, we needed to feed the values for the t£ x placeholder, which 1s 
also done by its name: 'tf x:0'. In this case, there is no need to feed the values for 
the true y values. This 1s because executing the y hat node does not depend on tf y 
in the computation graph that we built. 
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Now, let's visualize the predictions, as follows: 


oo IMpOLre MatpLouclib.pyplor. as ~pit 
22 
Pee x err = MDp.arange(-2, 4, Dsl) 
>>> 
Peer GZ =. LE.~crapn() 
>>> with tf.Session(graph=g2) as sess: 
new Saver = C£,train.import meta graph ( 
',/trained-model.meta') 
new saver.restore(sess, './trained-model') 


VY err = Sess,runt"y bar. 0", 

sea ak EeCO. OlCr={ tr xt" | x arr) 
>>> 

a> Plt. Ligure() 
go> Pltsploe(s Crain, Y Crain, *~Do") 

Poe Pele lOL( Test, Y test, “bo; alpha U.2) 
yo Pl OLOLt oll, 7 oer. Tite Oly "ary, Les 
>>> plt.show() 


The result is shown in the following figure, where both the training data and test data 
are displayed: 
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Saving and restoring a model 1s very often used during the training stage of large 
models as well. Since the training stage of large models can take several hours to 
days, we can break the training phase into smaller tasks. For example, if the intended 
number of epochs is 100, we can break it into 25 tasks, where each task would run 
four epochs one after the other. For this purpose, we can save the trained model and 
restore it in the next task. 
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Transforming Tensors as 
multidimensional data arrays 


In this section, we explore a selection of operators that can be used to transform 
tensors. Note that some of these operators work very similar to NumPy array 
transformations. However, when we are dealing with tensors with ranks higher than 
2, we need to be careful in using such transformations, for example, the transpose of 
a tensor. 


First, as in NumPy, we can use the attribute arr. shape to get the shape of a NumPy 
array. In TensorFlow, we use the t£.get shape function instead: 


>>> import tensorflow as tf 
>>> import numpy as np 

ae 

>>> g = tf.Graph () 

PoP? With Gs<as COSeTault(): 


ace =] Moser roy ile, 2ey oap cals 
[Say Oey Cay Cals 
lieg Cxeg Que Pe0d1) 
Tl = tf£.constant (arr, name='T1') 
Deine (ii) 
Ss = T1sget. shape () 
print('Shape of Tl is', s) 
a2 = Ul,.Variable (ll .rendom norma. 
shape=s) ) 
pranty CEZ) 
To. == tl ,Veariablettl.«tanoons Orla: | 
shape=(sves 11st () [0] ,))) 
ouanie gp yem Gece 


The output of the previous code example is as follows: 


Tensor("T1:0", shape=(3, 4), dtype=floato4) 

Shape of Tl is (3, 4) 

<ti.Vatiable *“Variable:0”" shape=(3, 4) dtype=float3sz rer> 
Cis Valleble “Variable 120" Shape=(5o;,) Geype-floaroZz rer 


Notice that we used s to create T2, but we cannot slice or index s for creating T3. 
Therefore, we converted s into a regular Python list by s.as list () and then used 
the usual indexing conventions. 
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Now, let's see how we can reshape tensors. Recall that in NumPy, we can use 
np.reshape OF arr. reshape for this purpose. In TensorFlow, we use the function 
tf.reshape to reshape a tensor. As is the case for NumPy, one dimension can be set 
to -1 so that the size of the new dimension will be inferred based on the total size of 
the array and the other remaining dimensions that are specified. 


In the following code, we reshape the tensor T1 to T4 and T5, both of which have 
rank 3: 


eer WLU OG.as OStault() 
T4 = tf.reshape(Tl, shape=[1l, 1, -l], 


name='T4"') 

print (T4) 

LD = tt.veshape (il, shape=[l, 3, —-ll, 
name='T5"') 

print (T5) 


The output is as follows: 


Tensor ("T4:0", shape=(1, 1, 12), dtype=floato4) 
Tensor("T5:0", shape=(1, 3, 4), dtype=floato4) 


Next, let's print the elements of 14 and T5: 


>>> with tf.Session(graph = g) as sess: 
print (sess.run(T4) ) 
print () 
Drine(Ssess.2un (TS) ) 


[[f 1. 2. 3. 3.5 4. 5. 6. 6.5 7. 8. 9. 9.5]9] 
[{[ 1 2 om ceed 

[ 4 5 6. 6.5] 

[ 7 8 9 9.5] ]] 


As we know, there are three ways to transpose an array in NumPy: arr.T, 
arr«LVanspose ()-s and np.transpose(arr). In TensorFlow, we use the 
tf.transpose function instead, and in addition to a regular transpose operation, we 
can change the order of dimensions in any way we want by specifying the order in 
perm=[...]. Here's an example: 


Ze7 WLM Gees. CetauLt |.) 
T6 = tf.transpose(T5, perm=[2, 1, OQ], 
name='To6') 
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Prine 1G) 

T7 = tf£.transpose(T5, perm=[0, 2, 1], 
name='T/7"') 

print (T/7) 


Tensor ("To6:0", shape=(4, 3, 1), dtype=floato4) 
Tensor("T7/:0", shape=(1, 4, 3), dtype=floato4) 


Next, we can also split a tensor into a list of subtensors using the t£.sp1lit function, 
as follows: 


PP WIE OG«as Ceraule() = 
Co. oO = Lessp ele lo, 
NUM, OL Size. Solats=Z, 
ax1is=2, name='T8") 
Prine (eS.SpLr) 


[<tf.Tensor 'T8:0' shape=(1, 3, 2) dtype=floato4>, 
<tf.Tensor 'T8:1" shape=(1, 3, 2) dtype=float64>] 


Here, it's important to note that the output is not a tensor object anymore; rather, it's 
a list of tensors. The name of these subtensors are 'T8:0' and 'T8:1"'. 


Lastly, another useful transformation 1s the concatenation of multiple tensors. If we 
have a list of tensors with the same shape and dt ype, we can combine them into one 
big tensor using the t£.concat function. An example is given in the following code: 


>>> g = ti.Graph() 
yer WiC G2as Cetauic() } 
tl = tf.ones(shape=(5, 1), 
dtype=tf.float32, name='tl1') 
t2 = tf.zeros(shape=(5, 1), 
dtype=tf.float32, name='t2"') 
print.) 
print (t2) 
>>> with g.as default(): 
t3 = tf£.concat([tl, t2], axis=0, name='t3') 
preine (l3) 
t4 = tf.concat([tl, t2], axis=l, name='t4') 
print (t4) 


Tensor ("t1:0", shape=(5, 1), dtype=float32) 
Tensor ("t2:0", shape=(5, 1), dtype=float32) 


Tensor ("t3:0", shape=(10, 1), dtype=float32) 
Tensor ("t4:0", shape=(5, 2), dtype=float32) 
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Let's print the values of these concatenated tensors: 


>>> with tf.Session(graph=g) as sess: 
print (t3.eval()) 
print () 
print (t4.eval()) 


COCOCCOFRFRFR EH 


m1 


PRPrPP PB 
So Oe Ge 
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Utilizing control flow mechanics in 
building graphs 


Now let's learn about an interesting TensorFlow mechanic. TensorFlow provides a 
mechanism for making decisions when building a graph. However, there are some 
subtle differences when we use Python's control flow statements compared to 
TensorFlow's control flow functions, when constructing computation graphs. 


To illustrate these differences with some simple code examples, let's consider 
implementing the following equation in TensorFlow: 


sty rey 


\x-y otherwise 


In the following code, we may naively use Python's if statement to build a graph 
that corresponds to the preceding equation: 


>>> import tensorflow as tf 
>>> 
Peo se, VY = Le 0, 220 
a 
>>> g = tf.Graph () 
Per WLU CGxas Cetauli() = 
Li x = Li.pleceholoer (el ype=ti.tloatoz,; 
Sshape=None, name='tf x") 
EL y = Efi splacenoloer (Clype-CL..floatoz, 
shape=None, name="tf y"') 


Lt x. < Ya 

res = TiseQ0(tl x, Li VY, Name="rSsulv add’) 
elec: 

Les = ElsSsUubLraclL (tr. x, LL VY, Neme="Tresule sup’) 


print ('Object:', res) 


Pee 
>>> with tf.Session(graph=g) as sess: 
Drinu(’ <= vies == Resulis* — (<= Vy 
Les seValL (teed, Greer, rr xe ys x, 


ice VeO"s 
Ky Yee, La 
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O 


Drinml( x = ve es == Resulis* |] ie = yi), 
heswe va. (eed O1Ce= tr eure xy 
ee "se 


The result of this code will be as follows: 


Ob ject: Tensor ("result acc. 0", Gtlype—flOae3Z) 
x < Vo True => Resulcr 3.0 
mS Vi False => Result: 3.0 


As you can see, the res object is a tensor named 'result add:0'. It 1s very 
important to understand that in the previous mechanism, the computation graph has 
only one branch associated with the addition operator, and the subtract operator has 
not been called. 


The TensorFlow computation graph is static, which means that once the computation 
graph 1s built, it remains unchanged during the execution process. So, even when we 
change the values of x and y and feed the new values to the graph, these new tensors 
will go through the same path in the graph. Therefore, in both cases, we see the same 
output 3.0 for x=2, y=1 and for x=1, y=2. 


Now, let's use the control flow mechanics in TensorFlow. In the following code, we 
implement the previous equation using the t£.cond function instead of Python's if 
Statement: 


>>> import tensorflow as tf 
=o 
oO My Y= Lal, 220 
> 
>>> g = tf.Graph () 
Per WIth Gees Cetault(): 
tr x = Eieplacenholoser (al ype-Cr.floatoz, 
shape=None, name="tf x") 
ey = liebe eno oer (Ory pe-UlL Oat, 
shape=None, name='tf y') 
res = Ui«COna(tr x = GE Y, 
Leambca: tCiaod (tr x, LL ‘y, 
Name="Lesult adg”), 
hamoGge: Ti.SUDbtTreaCl(Ct x, LT VY, 
NemMe="TLSsulL., sub” )) 
print ('Object:', res) 


>>> with tf.Session(graph=g) as sess: 
DPEImt(’x < ye 2S => Results" 2 (x < Vy), 
Pes .eval (reed: GiCcu={ "ei x0"? x, 
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her V0 Vo 
SS VY = Zale sO 
Priant(’x < y: 2S: —> Results’ @ {x = y), 
PSS eee (iced, Ole CE es 
eer ye 


The result will be as follows: 


Object: Tensor ("cond/Merge:0", dtype=float32) 
x < ye True => Resulce 3.0 
x < y: False -> Result: 1.0 


Here, we can see that the res object 1s named "cond/Merge:0". In this case, the 
computation graph has two branches with a mechanism to decide which branch to 
follow at execution time. Therefore, when x=2, y=1, it follows the addition branch 
and the output will be 3.0, while for x=1, y=2, the subtraction branch is pursued and 
the result will be 1.0. 


The following figure contrasts the differences in the computation graph of the 
previous implementation using the python if statement versus TensorFlow's 
t£.cond function; 


Python If tf.cond(...) 


cond 


result_add 
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In addition to t£.cond, TensorFlow offers several other control flow operators, such 
as tf.case and tf.while loop. For instance, tf£.case 1s the TensorFlow control 
flow equivalent to a Python if...else statement. Consider the following Python 
expression: 
Lt (x < y)2 
result 


else: 
result = 0 


| 
}— 


The t£.case equivalent to the previous statement for conditional execution 1n a 
TensorFlow graph would then be implemented as follows: 


fl = lambda: tf.constant (1) 
f2 = lambda: tf.constant (0) 
result = tf.case([(tf.less(x, y), £1)], default=f2) 


Similarly, we can add a while loop to a TensorFlow graph that increments the i 
variable by 1 until a threshold value (threshold) 1s reached, as follows: 


1 = tf.constant (0) 

threshold = 100 

c = lambda 1: tf.less(1i, 100) 

b lambda i: tf.add(i, 1) 

i tf.while loop(cond=c, body=b, loop vars=[1]) 


You can of course check out the official documentation for more information on the 
various control flow operators: 


https://www.tensorflow.org/api_guides/python/control flow_ops. 


You may have noticed that these computation graphs are built by TensorBoard, so 
now is a great time to take a good look at TensorBoard in the next section. 
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Visualizing the graph with 
TensorBoard 


A great feature of TensorFlow is TensorBoard, which is a module for visualizing the 
graph as well as visualizing the learning of a model. Visualizing the graph allows us 
to see the connection between nodes, explore their dependencies, and debug the 
model if needed. 


So let's visualize a network that we've already built, one which consists of a 
generator and a classifier part. We'll repeat some code that we previously used for 
defining the helper functions. So, revisit the Reusing variables section earlier in this 
chapter, for the function definitions of build generator and build classifier. 
Using these two helper functions, we will build the graph as follows: 


Ze? Daley Size=o4 
>>> g = tir.Graph() 
2 
Zo? Wi Gees Cefault(): 
tL. = PieplLacenoicer(siape=(5alCn S176, 200) 4 
dtype=tf.float32, 
name='tf X') 


## build the generator 
WILT CEsvariable scope ("generalor”): 
Oot CUCL = bulb GCSneraror(Oata-cr A; 
De Nod en=o0) 


## build the classifier 
WIL LE.Vetlable Scope ( cClassificr’) as Scope: 
## Classifier for the original data: 
CLE. OULL = Duidd Classifier (data=crl x, 
labels=tf.ones ( 
Siiape=—batch 81726) ) 


## reuse the classifier for generated data 
SCOpesFeuse Variables () 
Gls OUEZ = Duitd classifier (data=gen our, 
Pepe le=ULaeze los 
Shape=DbalCn S176) ) 


Note that no changes were needed so far for building the graph. So after building the 
graph, its visualization is straightforward. The following lines of code export the 
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graph for visualization purposes: 


>>> with tf.Session(graph=g) as sess: 
Sess.tUn(tl.Global variables. Anitializer{)) 


file Welter = Tt.summary. 1 lenricer ( 
logdir='./logs/', graph=g) 


This will create a new directory: logs/. Now, we just need to run the following 
command in a Linux or macOS Terminal: 


tensorboard --logdir logs/ 


This command will print a message, which 1s a URL address. You can try launching 
TensorBoard by copying the link, for example, http: //localhost:6006/#graphs, 
and pasting it into your browser's address bar. You should see the graph that 
corresponds to this model, as shown in the following figure: 
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classifier 


generator 


The large rectangular boxes indicate the two subnetworks that we built: generator 
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and classifier. Since we used the t£.variable scope function when we built this 
graph, all the components of each of these subnetworks are grouped into those 
rectangular boxes, as shown in the previous figure. 


We can expand these boxes to explore their details: using your mouse, click on the 
plus sign on the top-right corner of these boxes to expand them. Doing this, we can 
see the details of the generator subnetwork, as shown in the following figure: 
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classifier 






hidden ‘a... 


— ee ce ce ca 





hidden_pre 
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By exploring this graph, we can easily see that the generator has two weight tensors, 
named wi and w2. Next, let's expand the classifier subnetwork, as shown in the 
following figure: 


classifier 
Softmax softmax 1 weights 
bias — 
generator 


logits logits 1 


MatMul 2 5 MatMul 1 
Init 














weights 


generator 


tf_X 





As you can see 1n this figure, the classifier has two sources of input, where one input 
comes from the t£ x placeholder and the other one 1s in fact the output of the 
generator subnetwork. 
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Extending your TensorBoard experience 


As an interesting exercise, we suggest you use TensorBoard to visualize the different 
graphs we implemented throughout this chapter. For example, you could use similar 
steps for building the graphs, and then add extra lines for their visualization. You can 
also make graphs for the control flow section, which will show you the difference 
between graphs made by the Python if statement and the t£.cond function. 


For more information and examples for graph visualization, visit the official 
TensorFlow tutorials page at https://www.tensorflow.org/get_started/graph_viz. 
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Summary 


In this chapter, we covered in detail the key features and concepts of TensorFlow. 
We started with discussing TensorFlow's main features and advantages, and key 
TensorFlow concepts such as ranks and tensors. We then looked at TensorFlow's 
computation graphs, and discussed how to launch a graph in a session environment, 
and you learned about placeholders and variables. We then saw different ways to 
evaluate tensors and execute operators, using Python variables or by referring to 
them via their name in the graph. 


We went further to explore some of the essential TensorFlow operators and functions 
for transforming tensors, such as t£.transpose, tf.reshape, tf.split, and 
tf.concat. Finally, we saw how to visualize a TensorFlow computation graph using 
TensorBoard. Visualizing computation graphs using this module can be very useful, 
especially when we are debugging complex models. 


In the next chapter, we'll make use of this library to implement an advanced image 
classifier: a Convolutional Neural Network (CNN). CNNs are powerful models 
and have shown great performance in image classification and computer vision. 
We'll cover the basic operations in CNNs, and we'll implement deep convolutional 
networks for image classification using TensorFlow. 
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Chapter 15. Classifying Images with 
Deep Convolutional Neural Networks 


In the previous chapter, we looked in depth at different aspects of the TensorFlow 
API, became familiar with tensors, naming variables, and operators, and learned how 
to work with variable scopes. In this chapter, we'll now learn about Convolutional 
Neural Networks (CNNs), and how we can implement CNNs in TensorFlow. We'll 
also take an interesting journey in this chapter as we apply this type of deep neural 
network architecture to image classification. 


So we'll start by discussing the basic building blocks of CNNs, using a bottom-up 
approach. Then we'll take a deeper dive into the CNN architecture and how to 
implement deep CNNs in TensorFlow. Along the way we'll be covering the 
following topics: 


e Understanding convolution operations in one and two dimensions 
e Learning about the building blocks of CNN architectures 
e Implementing deep convolutional neural networks in TensorFlow 
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Building blocks of convolutional neural 
networks 


Convolutional neural networks, or CNNs, are a family of models that were inspired 
by how the visual cortex of human brain works when recognizing objects. 


The development of CNNs goes back to the 1990's, when Yann LeCun and his 
colleagues proposed a novel neural network architecture for classifying handwritten 
digits from images (Handwritten Digit Recognition with a Back-Propagation 
Network, Y LeCun, and others, 1989, published at Neural Information Processing 
Systems.(NIPS) conference). 


Due to the outstanding performance of CNNs for image classification tasks, they 
have gained a lot of attention and this led to tremendous improvements in machine 
learning and computer vision applications. 


In the following sections, we next see how CNNs are used as feature extraction 
engines, and then we'll delve into the theoretical definition of convolution and 
computing convolution in one and two dimensions. 
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Understanding CNNs and learning feature 
hierarchies 


Successfully extracting salient (relevant) features is key to the performance of any 
machine learning algorithm, of course, and traditional machine learning models rely 
on input features that may come from a domain expert, or are based on 
computational feature extraction techniques. Neural networks are able to 
automatically learn the features from raw data that are most useful for a particular 
task. For this reason, it's common to consider a neural network as a feature extraction 
engine: the early layers (those right after the input layer) extract low-level features. 


Multilayer neural networks, and in particular, deep convolutional neural networks, 
construct a so-called feature hierarchy by combining the low-level features in a 
layer-wise fashion to form high-level features. For example, if we're dealing with 
images, then low-level features, such as edges and blobs, are extracted from the 
earlier layers, which are combined together to form high-level features — as object 
shapes like a building, a car, or a dog. 


As you can see in the following image, a CNN computes feature maps from an 
input image, where each element comes from a local patch of pixels in the input 
image: 





(Photo by Alexander Dummer on Unsplash) 
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This local patch of pixels is referred to as the local receptive field. CNNs will 
usually perform very well for image-related tasks, and that's largely due to two 
important ideas: 


e Sparse-connectivity: A single element in the feature map is connected to only a 
small patch of pixels. (This is very different from connecting to the whole input 
image, in the case of perceptrons. You may find it useful to look back and 
compare how we implemented a fully connected network that connected to the 
whole image, in Chapter 12, /mplementing a Multilayer Artificial Neural 
Network from Scratch.) 

e Parameter-sharing: The same weights are used for different patches of the 
input image. 


As a direct consequence of these two ideas, the number of weights (parameters) in 
the network decreases dramatically, and we see an improvement in the ability to 
capture salient features. Intuitively, it makes sense that nearby pixels are probably 
more relevant to each other than pixels that are far away from each other. 


Typically, CNNs are composed of several Convolutional (conv) layers and 
subsampling (also known as Pooling (P)) layers that are followed by one or more 
Fully Connected (FC) layers at the end. The fully connected layers are essentially a 
multilayer perceptron, where every input unit 7 is connected to every output unit / 


WwW, . . 
with weight “ (which we learned about in Chapter 12, Implementing a Multilayer 
Artificial Neural Network from Scratch). 


Please note that subsampling layers, commonly known as pooling layers, do not 
have any learnable parameters; for instance, there are no weights or bias units in 
pooling layers. However, both convolution and fully connected layers have such 
weights and biases. 


In the following sections, we'll study convolutional and pooling layers in more detail 
and see how they work. To understand how convolution operations work, let's start 
with a convolution in one dimension before working through the typical two- 
dimensional cases as applications for two-dimensional images later. 
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Performing discrete convolutions 


A discrete convolution (or simply convolution) is a fundamental operation in a 
CNN. Therefore, it's important to understand how this operation works. In this 
section, we'll learn the mathematical definition and discuss some of the naive 
algorithms to compute convolutions of two one-dimensional vectors or two two- 
dimensional matrices. 


Please note that this description is solely for understanding how a convolution 
works. Indeed, much more efficient implementations of convolutional operations 
already exist in packages such as TensorFlow, as we will see later in this chapter. 


Note 


Mathematical notation 
In this chapter, we will use subscripts to denote the size of a multidimensional array; 


forexample, "'*’? is a two-dimensional array of size"! “"'2. We use brackets 


to denote the indexing of a multidimensional array. For example, A li, | means the 


4 


*® to denote the convolution operation between two vectors or matrices, which is not 
to be confused with the multiplication operator * in Python. 


element at index of matrix A. Furthermore, note that we use a special symbol 


Performing a discrete convolution in one dimension 


Let's start with some basic definitions and notations we are going to use. A discrete 


= xry . 
convolution for two one-dimensional vectors x and w is denoted by J , in 


which vector x is our input (Sometimes called signal) and w is called the filter or 
kernel. A discrete convolution 1s mathematically defined as follows: 


y=x*wro yli| = y x|i ~- k | wk] 


k=-—0 
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Here, the brackets [|] are used to denote the indexing for vector elements. The index i 
runs through each element of the output vector y. There are two odd things in the 


preceding formula that we need to clarify: ~“% to t™® indices and negative 
indexing for x. 


Tip 
Cross-correlation 


Cross-correlation (or simply correlation) between an input vector and a filter 1s 


denoted by 2 and is very much like a sibling for a convolution with a 
small difference; the difference 1s that in cross-correlation, the multiplication 1s 
performed in the same direction. Therefore, it is not required to rotate the filter 
matrix w in each dimension. Mathematically, cross-correlation 1s defined as follows: 


+00 


y=x*wa plil= pa x]i+k wl k | 


k=—<0 


The same rules for padding and stride may be applied to cross-correlation as well. 


The first issue where the sum runs through indices from ~% to F% seems odd 


mainly because 1n machine learning applications, we always deal with finite feature 


i Be eee ah 


vectors. For example, if x has 10 features with indices , then indices 


—0 =] ang 10:+20 are out of bounds for x. Therefore, to correctly compute the 


Summation shown in the preceding formula, it is assumed that x and w are filled with 
zeros. This will result in an output vector y that also has infinite size with lots of 
zeros as well. Since this is not useful in practical situations, x 1s padded only with a 
finite number of zeros. 


This process is called zero-padding or simply padding. Here, the number of zeros 
padded on each side 1s denoted by p. An example padding of a one-dimensional 
vector x 1s Shown in the following figure: 
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Original x: 








Let's assume that the original input x and filter w have n and m elements, 


respectively, where /7! Sn . Therefore, the padded vector x" has size n + 2p. 
Then, the practical formula for computing a discrete convolution will change to the 
following: 


t=m-—| 


y=x*twro yli] = » x i +m— k| wk] 


&=0 


Now that we have solved the infinite index issue, the second issue 1s indexing x with 
i + m-k. The important point to notice here is that x and w are indexed in different 
directions in this summation. For this reason, we can flip one of those vectors, x or 
w, after they are padded. Then, we can simply compute their dot product. 


Let's assume we flip the filter w to get the rotated filter ™ . Then, the dot product 
x|i:i+m|.w' | fi | Jititm| . 

is computed to get one element , , where x! | isa 
patch of x with size m. 


This operation is repeated like in a sliding window approach to get all the output 
elements. The following figure provides an example with x = (3,2,1,7,1,2,5,4) and 


WOW! eBook 
www.wowebook.org 


Step 2: For each output 
element i, compute the 
dot-product x/1:1+4].w" 


(move filter by 2 cells) 


yl2]: 
W4+7+1x% 42thy | : 


8] yt RIERA! 
A+a+sxhraxe: i  [Y]1[%l% 





You can see in the preceding example that the padding size is zero (p = 0). Notice 


that the rotated filter ” is shifted by two cells each time we shift. This shift is 
another hyperparameter of a convolution, the stride s. In this example, the stride 1s 
two, s = 2. Note that the stride has to be a positive number smaller than the size of 
the input vector. We'll talk more about padding and strides in the next section! 


The effect of zero-padding in a convolution 


So far here, we've used zero-padding 1n convolutions to compute finite-sized output 


> 
vectors. Technically, padding can be applied with any p20 . Depending on the 
choice p, boundary cells may be treated differently than the cells located in the 
middle of x. 


Now consider an example where 1 = 5, m = 3. Then, p = 0, x[O] is only used in 
computing one output element (for instance, y[0]), while x[1] 1s used in the 
computation of two output elements (for instance, y[0] and y[1]). So, you can see 
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that this different treatment of elements of x can artificially put more emphasis on the 
middle element, x[2], since it has appeared in most computations. We can avoid this 
issue 1f we choose p = 2, in which case, each element of x will be involved in 
computing three elements of y. 


Furthermore, the size of the output y also depends on the choice of the padding 
strategy we use. There are three modes of padding that are commonly used 1n 
practice: full, same, and valid: 


e In the full mode, the padding parameter p 1s set to p =m - J. Full padding 
increases the dimensions of the output; thus, it is rarely used in convolutional 
neural network architectures. 

e Same padding is usually used if you want to have the size of the output the 
same as the input vector x. In this case, the padding parameter p 1s computed 
according to the filter size, along with the requirement that the input size and 
output size are the same. 

e Finally, computing a convolution in the valid mode refers to the case where p = 
0 (no padding). 


The following figure illustrates the three different padding modes for a simple 5 x 5 
pixel input with a kernel size of 3 x 3 and a stride of 1: 


Full padding Same padding Valid padding 


Output image assuming 
ST stride=1 


= ral 





ee 





Kernel (filter) 





Input image 





WOW! eBook 
www.wowebook.org 


The most commonly used padding mode in convolutional neural networks is same 
padding. One of its advantages over the other padding modes 1s that same padding 
preserves the height and width of the input images or tensors, which makes 
designing a network architecture more convenient. 


One big disadvantage of the valid padding versus full and same padding, for 
example, is that the volume of the tensors would decrease substantially in neural 
networks with many layers, which can be detrimental to the network performance. 


In practice, it is recommended that you preserve the spatial size using same padding 
for the convolutional layers and decrease the spatial size via pooling layers instead. 
As for the full padding, its size results in an output larger than the input size. Full 
padding is usually used in signal processing applications where it 1s important to 
minimize boundary effects. However, in deep learning context, boundary effect is 
not usually an issue, so we rarely see full padding. 


Determining the size of the convolution output 


The output size of a convolution is determined by the total number of times that we 
shift the filter w along the input vector. Let's assume that the input vector has size n 


and the filter is of size m. Then, the size of the output resulting from XW with 
padding p and stride s 1s determined as follows: 


+ 2pD—-7) 
0=|" Pp ae 
s 


Here, a denotes the floor operation: 
Tip 


The floor operation returns the largest integer that 1s equal or smaller to the input, for 
example: 
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floor(1.77) =| 1.77 |=1 


Consider the following two cases: 


e Compute the output size for an input vector of size 10 with a convolution kernel 
of size 5, padding 2, and stride 1: 


2x2-5 
n-10,m =5,p=2,5=1 yon] CEERI? | +1=10 


(Note that in this case, the output size turns out to be the same as the input; 
therefore, we conclude this as mode='same') 
e How can the output size change for the same input vector, but have a kernel of 
size 3, and stride 2? 
lO+2x2-3 
a= m= 3,9 > 2.) 2.8 =2-r0=| PEK 1-6 


,, 


— 


If you are interested to learn more about the size of the convolution output, we 
recommend the manuscript A guide to convolution arithmetic for deep learning, 
Vincent Dumoulin and Francesco Visin, 2016, which 1s freely available at 


https://arxiv.org/abs/1603.07285. 


Finally, in order to learn how to compute convolutions in one dimension, a naive 
implementation 1s shown in the following code block, and the results are compared 
with the numpy.convolve function. The code 1s as follows: 


>>> import numpy as np 
>>> def convlid(x, w, p=0, s=1): 


Ww Ot = Np.arrayiwilss=L!) 
x padded = np.array (x) 
te WO oe Oe 


Zero pad: = Nps Zeros (Snape=p) 
x Paagceo = 1p, concatenate | Zer°O. pag, 
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x padded, 


ZErO. pad.) 
res = [] 
for i in range(0O, int(len(x)/s),s): 
res.append(np.sum(x padded|121¢w. ,ot.shape0)].| * 
w rot) ) 
return np.array (res) 
>>> ## Testing: 
Soo se = Lily oe Ze Se Sy Cy dy 3] 
>>> w= [1, 0, 3, 1, 2] 
>>> print('Convlid Implementation:', 
Cconvlda(x, Wy, Dp=Z2, S=l)) 
Gonvid Implementation: [ 5. 14. 16. 26. 24. 34. 19. 22.1] 
>>> print('Numpy Results:' 
. npo.convolve(x, w, mode='same') ) 
Numpy Results: [ 5 14 16 26 24 34 19 22] 


So far, here, we have explored the convolution in 1D. We started with 1D case to 
make the concepts easier to understand. In the next section, we will extend this to 
two dimensions. 


Performing a discrete convolution in 2D 


The concepts you learned in the previous sections are easily extendible to two 


‘ . : , : , : Hy Sis 
dimensions. When we deal with two-dimensional input, such as a matrix eae 


m, SN, m, SN, 


HY) hs HP 


and the filter matrix , where and , then the matrix 


¥ =X *W ig the result of 2D convolution of X with W. This is mathematically 
defined as follows: 


Y¥=X+W >YP[i, j|= 3 y X[i-k, j-k,|W[k, 4] 


i, 2—2 hk, =o 


Notice that if you omit one of the dimensions, the remaining formula is exactly the 
same as the one we used previously to compute the convolution in ID. In fact, all the 
previously mentioned techniques, such as zero-padding, rotating the filter matrix, 
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and the use of strides, are also applicable to 2D convolutions, provided that they are 
extended to both the dimensions independently. The following example illustrates 


the computation of a 2D convolution between an input matrix A 5.3 , a kernel matrix 


yy. p=tih  g=(2.2 | | 
Wo: , padding t ( ) , and stride ( , ) . According to the specified 
padding, one layer of zeros are padded on each side of the input matrix, which 
r pads fed 


results in the padded matrix “~*5 _, as follows: 


0.5 0.4 





With the preceding filter, the rotated filter will be: 


05 1 OS 
w’=|0.1 04 03 
0.4 0.7 0.5 
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Note that this rotation is not the same as the transpose matrix. To get the rotated 

filter in NumPy, we can write Ww rot=W[::-1,::-1]. Next, we can shift the rotated 
; . ; added _. oo. . 

filter matrix along the padded input matrix xe like a sliding window and 

compute the sum of the element-wise product, which is denoted by the © operator 

in the following figure: 


ypadded 





The result will be the 2 x 2 matrix Y. 


Let's also implement the 2D convolution according to the naive algorithm described. 
The scipy.signal package provides a way to compute 2D convolution via the 
scipy.signal.convolve2d function: 


>>> import numpy as np 
>>> import scipy.signal 


>>> def conv2d(X, W, p=(0, 0), s=(1, 1)): 
W ©Ol = Mp.,array(W) |. 2—L1,33-1) 
xX OFIG = p.array (x) 
Ol = % OF1G.Siape 0) + 2p io] 
n2 = X% Oro .shape |) a 2*p [1] 
xX paecced = Np<.2ero0s (snape=—(ni, nz) 
X padded[p[0]:p[0]+X orig.shape[0], 
Dill sp Lia: Ofte .Ssnapel tl] | = x orig 


res = [] 
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for 1 in range(0, int((X padded.shape[0] - \ 
W vot.shape(U))/slol)j+ly, 6[0])< 
res.append([]) 
For 3} 2M Pange(0;, antt(x% pacesdc.snape |i | 
W rot.shape[1])/s[1])+1, s[1]): 
xX. Sub. = % padded |1si+W rot.shape|0)| 
J- Jew. £OUC. Shape (1) 
16s |—1] «appenai(np.sum(xX sub * W ror 
return (np.array (res) ) 


) 


>>> X= [[l, 3, 2, 4], [5, ©, 1, 3), Ll, 2, 0, 2), [3, 4, 3, 2]] 
yor Mw = Lip Uy, sy, Les 2p Liye Ie, 2, 2h) 


>>> print ConvZc Implementation. \n", 
conv2d(X, W, p=(1, 1), s=(1, 1))) 
ponued Implementation: 
Ci dts, 2h, O26, eax 
[ os 256 wae. doa! 
L dos 2oe Zoe 7] 
[ 11. 17. #14. 9.) ] 


>>> print('SciPy Results:\n', 
scipy.signal.convolve2d(X, W, mode='same') ) 
Sci Py ReESULLS: 
Lik 2S 32 13] 
[19 25 24 13] 
[13 we Zo. 17] 
[11 17 14 9]] 


Tip 


We provided a naive implementation to compute a 2D convolution for the purpose of 
understanding the concepts. However, this implementation 1s very inefficient in 
terms of memory requirements and computational complexity. Therefore, 1t should 
not be used in real-world neural network applications. 


In recent years, much more efficient algorithms have been developed that use the 
Fourier transformation for computing convolutions. It is also important to note that 
in the context of neural networks, the size of a convolution kernel is usually much 
smaller than the size of the input image. For example, modern CNNs usually use 
kernel sizes such as | x 1, 3 x 3, or 5 x 5, for which efficient algorithms have been 
designed that can carry out the convolutional operations much more efficiently, such 
as the Winograd's Minimal Filtering algorithm. These algorithms are beyond the 
scope of this book, but 1f you are interested to learn more, you can read the 
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manuscript: Fast Algorithms for Convolutional Neural Networks, Andrew Lavin and 
Scott Gray, 2015, which is freely available at (https://arxiv.org/abs/1509.09308). 


In the next section, we will discuss subsampling, which is another important 
operation often used in CNNs. 
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Subsampling 


Subsampling is typically applied in two forms of pooling operations in convolutional 
neural networks: max-pooling and mean-pooling (also known as average-pooling). 


The pooling layer is usually denoted by . Here, the subscript determines the 
size of the neighborhood (the number of adjacent pixels in each dimension), where 
the max or mean operation 1s performed. We refer to such a neighborhood as the 
pooling size. 


The operation 1s described in the following figure. Here, max-pooling takes the 
maximum value from a neighborhood of pixels, and mean-pooling computes their 
average: 


ypadded 





The advantage of pooling is twofold: 


e Pooling (max-pooling) introduces some sort of local invariance. This means 
that small changes 1n a local neighborhood do not change the result of max- 
pooling. Therefore, it helps generate features that are more robust to noise in the 
input data. See the following example that shows max-pooling of two different 


| | X, | 
input matrices x, and ~ results in the same output: 
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10 «6255 125 O 170 100 
70 255 105 25 25 #70 
“ 255 O 150 0 10° 10 
‘|! 0 255 10 10 150 20 
70 #15 200 100 95 OQ ae 
232 an. 
35 25 100 20 O 60 max—pooling FP, , | 
————— eee 5 «6150 «(1350 
100 100 100 50 100 50 rn 
95 255 100 125 125 170 : de 
80 40 10 #10 #125 150 
X, = 


Ze) St ISO 20 1a 
30 30 «6150 100) 70 70 
70 30) «100 200 70 95 


Pooling decreases the size of features, which results in higher computational 
efficiency. Furthermore, reducing the number of features may reduce the degree 
of overfitting as well. 


Note 


Traditionally, pooling is assumed to be nonoverlapping. Pooling is typically 
performed on nonoverlapping neighborhoods, which can be done by setting the 
stride parameter equal to the pooling size. For example, a nonoverlapping 

1, xn 


‘ | | S=(hs) 
= requires a stride parameter om 


pooling layer 
On the other hand, overlapping pooling occurs if the stride is smaller than 
pooling size. An example where overlapping pooling is used 1n a convolutional 
network is described in JmageNet Classification with Deep Convolutional 
Neural Networks, A. Krizhevsky, I. Sutskever, and G. Hinton, 2012, which 1s 


freely available as a manuscript at https://papers.nips.cc/paper/4824-imagenet- 
classification-with-deep-convolutional-neural-networks. 


WOW! eBook 
www.wowebook.org 


Putting everything together to build a 
CNN 


So far, we've learned about the basic building blocks of convolutional neural 
networks. The concepts illustrated in this chapter are not really more difficult than 
traditional multilayer neural networks. Intuitively, we can say that the most 
important operation in a traditional neural network is the matrix-vector 
multiplication. 


For instance, we use matrix-vector multiplications to pre-activations (or net input) as 


in 4 = Wx +b . Here, x 1s a column vector representing pixels, and W is the 
weight matrix connecting the pixel inputs to each hidden unit. In a convolutional 
neural network, this operation is replaced by a convolution operation, as in 


A=W*X+0 , where X is a matrix representing the pixels in a height x width 
arrangement. In both cases, the pre-activations are passed to an activation function to 


obtain the activation of a hidden unit H=¢ (A ) , where p is the activation 
function. Furthermore, recall that subsampling is another building block of a 
convolutional neural network, which may appear in the form of pooling, as we 
described in the previous section. 
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Working with multiple input or color channels 


An input sample to a convolutional layer may contain one or more 2D arrays or 


Nx N 


matrices with dimensions ~ ‘2 (for example, the image height and width in 


H NI 
pixels). These N, x NV, matrices are called channels. Therefore, using multiple 


channels as input to a convolutional layer requires us to use a rank-3 tensor or a 


N,xNyxC,, 


three-dimensional array: , where ~” is the number of input channels. 


For example, let's consider images as input to the first layer of a CNN. If the image 


C iT 


is colored and uses the RGB color mode, then =3 (for the red, green, and blue 


color channels in RGB). However, if the image is in grayscale, then we have Ci, =I 


because there is only one channel with the grayscale pixel intensity values. 
Tip 


When we work with images, we can read images into NumPy arrays using the 
'uints' (unsinged 8-bit integer) data type to reduce memory usage compared to 16- 
bit, 32-bit, or 64-bit integer types, for example. Unsigned 8-bit integers take values 
in the range [0, 255], which are sufficient to store the pixel information in RGB 
images, which also take values in the same range. 


Next, let's look at an example of how we can read in an image into our Python 
session using SciPy. However, please note that reading images with SciPy requires 
that you have the Python Imaging Library (PIL) package installed. We can install 
Pillow (https://python-pillow.org), a more user-friendly fork of PIL, to satisfy those 
requirements, as follows: 


pip install pillow 


Once Pillow is installed, we can use the imread function from the scipy.misc 
module to read an RGB image (this example image is located in the code bundle 
folder that 1s provided with this chapter at https://github.com/rasbt/python-machine- 


learning-book-2nd-edition/tree/master/code/ch15): 


>>> import scipy.misc 
>>> img = scipy.misc.imread('./example-image.png', 
mode='RGB" ) 
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>>> print('Image shape:', img.shape) 
Image shape: (252, 221, 3) 
>>> print('Number of channels:', img.shape[2]) 
Number of channels: 3 
>>> print('Image data type:', img.dtype) 
Image data type: uint8 
Zo Pein mo | wWoUehOA, VO0SLOZ, =.) 
[[ [179 134 110] 
[182 136 112] ] 


eo Or oo: 214] 
[182 137 113]]] 


Now that we have familiarized ourselves with the structure of input data, the next 
question is how can we incorporate multiple input channels in the convolution 
operation that we discussed in the previous sections? 


The answer is very simple: we perform the convolution operation for each channel 

separately and then add the results together using the matrix summation. The 

convolution associated with each channel (c) has its own kernel matrix as 

W : ae c| oo | , | 
' 4. The total pre-activation result is computed in the following formula: 


r= yw: ae 2 cae 
a kernel matrix W ty c= 


Mt =I *C 55 pre - activation : A=Vour+b 


Given a sample X ) 
Nn) XN, 





and bias value / 





Feature map: H = ¢( A) 


Be, 


The final result, h, is called a feature map. Usually, a convolutional layer of a CNN 
has more than one feature map. If we use multiple feature maps, the kernel tensor 


width x height xC, xC 


becomes four-dimensional: out | Here, width x height is 


the kernel size, ~'”’ is the number of input channels, and ~ °”’ is the number of 
output feature maps. So, now let's include the number of output feature maps in the 
preceding formula and update it as follows: 


WOW! eBook 
www.wowebook.org 


Given a sample X,.,..0 yor |: k= ya Ye 2 ek Pe | 
kernel matrix Wye mxcxc, 2 AL A;aVO"[: 2 A] + blk 
and bias vector 6, Hf : . .k | = o( A : s A) 


To conclude our discussion of computing convolutions in the context of neural 
networks, let's look at the example in the following figure that shows a convolutional 
layer, followed by a pooling layer. 


In this example, there are three input channels. The kernel tensor is four- 


; . a IPL, & FPL. 
dimensional. Each kernel matrix is denoted as _! -, and there are three of them, 


one for each input channel. Furthermore, there are five such kernels, accounting for 
five output feature maps. Finally, there is a pooling layer for subsampling the feature 
maps, as shown in the following figure: 


Convolution Pooling 
: layer 


Ci, =e 
Sum over input 
channels 





How many trainable parameters exist in the preceding example? 
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Tip 


To ulustrate the advantages of convolution, parameter-sharing and sparse- 
connectivity, let's work through an example. The convolutional layer in the network 
shown in the preceding figure 1s a four-dimensional tensor. So, there are 
m,xm,x3x5 a | 

= parameters associated with the kernel. Furthermore, there is 
a bias vector for each output feature map of the convolutional layer. Thus, the size of 
the bias vector is 5. Pooling layers do not have any (trainable) parameters; therefore, 
we can write the following: 


m,xm,*x3x34+5 


4 xn. 3 | a 
Tie , assuming that the convolution 1s performed 


ih My: XD 


If input tensor is of size 


with mode='same', then the output feature maps would be of size 


Note that this number is much smaller than the case if we wanted to have a fully 
connected layer instead of the convolution layer. In the case of a fully connected 
layer, the number of parameters for the weight matrix to reach the same number of 
output units would have been as follows: 

(n, Xn, x 3) x(n, XN, x 5) =(n 0m.) xoRS 


Tip 


m <M ang Mm, <N, 


Given that ~, we can see that the difference in the number of 


trainable parameters is huge. 


In the next section, we will talk about how to regularize a neural network. 
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Regularizing a neural network with dropout 


Choosing the size of a network, whether we are dealing with a traditional (fully 
connected) neural network or a CNN, has always been a challenging problem. For 
instance, the size of a weight matrix and the number of layers need to be tuned to 
achieve a reasonably good performance. 


The capacity of a network refers to the level of complexity of the function that it can 
learn. Small networks, networks with a relatively small number of parameters, have 
a low capacity and are therefore likely to be under fit, resulting in poor performance 
since they cannot learn the underlying structure of complex datasets. 


Yet, very large networks may more easily result in overfitting, where the network 
will memorize the training data and do extremely well on the training set while 
achieving poor performance on the held-out test set. When we deal with real-world 
machine learning problems, we do not know how large the network should be a 
priori. 


One way to address this problem is to build a network with a relatively large 
capacity (in practice, we want to choose a capacity that is slightly larger than 
necessary) to do well on the training set. Then, to prevent overfitting, we can apply 
one or multiple regularization schemes to achieve good generalization performance 
on new data, such as the held-out test set. A popular choice for regularization is L2 
regularization, which we discussed previously 1n this book. 


In recent years, another popular regularization technique called dropout has 
emerged that works amazingly well for regularizing (deep) neural networks 
(Dropout: a simple way to prevent neural networks from overfitting, Nitish 
Srivastava and. others, Journal of Machine Learning Research 15.1, pages 1929- 


1958, 2014, http://www.jmlr.org/papers/volume15/srivastaval4a/srivastaval 4a.pdf). 


Intuitively, dropout can be considered as the consensus (averaging) of an ensemble 
of models. In ensemble learning, we train several models independently. During 
prediction, we then use the consensus of all the trained models. However, both 
training several models and collecting and averaging the output of multiple models is 
computationally expensive. Here, dropout offers a workaround with an efficient way 
to train many models at once and compute their average predictions at test or 
prediction time. 
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Dropout is usually applied to the hidden units of higher layers. During the training 
phase of a neural network, a fraction of the hidden units is randomly dropped at 


Pp drop p keep — | - P drap 


every iteration with probability (or the keep probability 


). 
This dropout probability is determined by the user and the common choice is 


p=0.5 , as discussed in the previously mentioned article by Nitish Srivastava and 
others, 2014. When dropping a certain fraction of input neurons, the weights 
associated with the remaining neurons are rescaled to account for the missing 
(dropped) neurons. 


The effect of this random dropout forces the network to learn a redundant 
representation of the data. Therefore, the network cannot rely on an activation of any 
set of hidden units since they may be turned off at any time during training and 1s 
forced to learn more general and robust patterns from the data. 


This random dropout can effectively prevent overfitting. The following figure shows 


an example of applying dropout with probability p=0.5 during the training phase, 
thereby half of the neurons become inactive randomly. However, during prediction, 
all neurons will contribute to computing the pre-activations of the next layer. 


Training: Evaluation: 
dropout probability p=50% use all units 
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As shown here, one important point to remember 1s that units may drop randomly 
during training only, while for the evaluation phase, all the hidden units must be 


| | er: ae ere . 
active (for instance, Prop or Pheer ). To ensure that the overall activations 


are on the same scale during training and prediction, the activations of the active 
neurons have to be scaled appropriately (for example, by halving the activation if the 


p= 0.5), 


However, since it is inconvenient to always scale activations when we make 
predictions in practice, TensorFlow and other tools scale the activations during 
training (for example, by doubling the activations if the dropout probability was set 


p= 0.9) 


So, what is the relationship between dropout and ensemble learning? Since we drop 
different hidden neurons at each iteration, effectively we are training different 
models. When all these models are finally trained, we set the keep probability to 1 
and use all the hidden units. This means we are taking the average activation from all 
the hidden units. 


dropout probability was set to 


to 
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Implementing a deep convolutional 
neural network using TensorFlow 


In Chapter 13, Parallelizing Neural Network Training with TensorFlow, you may 
recall that we implemented a multilayer neural network for handwritten digit 
recognition problems, using different API levels of TensorFlow. You may also recall 
that we achieved about 97 percent accuracy. 


So now, we want to implement a CNN to solve this same problem and see its 
predictive power in classifying handwritten digits. Note that the fully connected 
layers that we saw 1n the Chapter 13, Parallelizing Neural Network Training with 
TensorFlow were able to perform well on this problem. However, 1n some 
applications, such as reading bank account numbers from handwritten digits, even 
tiny mistakes can be very costly. Therefore, it is crucial to reduce this error as much 
as possible. 
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The multilayer CNN architecture 


The architecture of the network that we are going to implement is shown in the 
following figure. The input is 28 x 28 grayscale images. Considering the number of 
channels (which is | for grayscale images) and a batch of input images, the input 
tensor's dimensions will be batchsize x 28 x 28 x I. 


The input data goes through two convolutional layers that have a kernel size of 5 x 5. 
The first convolution has 32 output feature maps, and the second one has 64 output 
feature maps. Each convolution layer is followed by a subsampling layer 1n the form 
of a max-pooling operation. 


Then a fully-connected layer passes the output to a second fully-connected layer, 
which acts as the final softmax output layer. The architecture of the network that we 
are going to implement is shown 1n the following figure: 


Conv 
Pooling 2*2 5x5*x64 Pooling 2*2 


ro 

















28%28x1 24x24x32 42*12«x32 8x8x64 Ax4x64 





The dimensions of the tensors 1n each layer are as follows: 
[ba tchsize x 28 x 28 x 1] 
e Input: 
| batchsize x24 24x 32| 
onv_ 1: 
© Pooling 1; [bachsize x12 «12 x 32] 
~’ | batchsize x8xX8X 64 | 
batchsizex 4x 4x 64] 


e C 


e Co 
e Pooling 2: | 
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|ba tchsize x | 024 | 
ba tchsize X 10] 


e FC 1: 


e FC 2 and softmax layer: 


We'll implement this network using two APIs: the low-level TensorFlow API and 
the TensorFlow Layers API. But first, let's define some helper functions at the 
beginning of the next section. 
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Loading and preprocessing the data 


If you'll recall again from Chapter 13, Parallelizing Neural Network Training with 
TensorFlow, we used a function called load mnist to read the MNIST handwritten 
digit dataset. Now we need to repeat the same procedure here as well, as follows: 


>>> 
>>> 
>>> 
>>> 
2 


>>> 
>>> 


>>> 
>>> 
>>> 


#### Loading the data 
X data, y data = load mnist('./mnist/', kind='train') 
prant(’Rows: 1}, COlUMmnS: 4.) %.rOrmac 4 

xX Gata.shapel Ul, x» Cava.siape | 1) )) 
X test, y test = load mnist('./mnist/', kind='t10k"') 
print('Rows: {}, Columns: {}'.format ( 

x Vester, shape Ol, % test.snape|i):) } 


X train, y train = X data[:50000,:], y data[:50000] 
x Veli, Yi Valid = x Gata 000004; 3), y-CabaloU00U. | 


print ('Training: -y & traim,enape, Y train.Shape) 
PEIntt’Valicavion. *", x Valic«shape, ¥ vValid.shape) 
Pear (* Teor. 2 "y; & UEsStasnape, Y Cestsshape) 


We are splitting the data into a training, a validation, and a test sets. The following 


result shows the shape of each set: 


Rows: 60000, Columns: 784 
Rows: 10000, Columns: 784 


Training: (50000, 784) (50000,) 
Validation: (10000, 784) (10000,) 
Teste Seu: (10000, 784) (10000,) 


After we've loaded the data, we need a function for iterating through mini-batches of 


data, as follows: 


Per Oer Daten. Generavor(x; VY; Datcn S17S6=-G64, 
SnUELteG=Falsec, tendon scco-None) = 


1dx = np.arange(y.shapel[0]) 
if shuffle: 
cng = Dp.Praencom.Rangomstate {random seed) 
rng.shuffle (1dx) 
X = X[1dx] 
y = yl[idx] 


for 1 2h Lange(0, AsShape [Oly Datch s612e); 
ViGLO. (Xa. arbetCcn size, 2iy Viltsardbatch Size), 
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This function returns a generator with a tuple for a match of samples, for instance, 
data_X and labels y. We then need to normalize the data (mean centering and division 
by the standard deviation) for better training performance and convergence. 


We compute the mean of each feature using the training data (x train) and calculate 
the standard deviation across all features. The reason why we don't compute the 
standard deviation for each feature individually 1s because some features (pixel 
positions) in image datasets such as MNIST have a constant value of 255 across all 
images corresponding to white pixels in a grayscale image. 


A constant value across all samples indicates no variation, and therefore, the 
standard deviation of those features will be zero, and a result would yield the 
division-by-zero error, which 1s why we compute the standard deviation from the 
X train atray using np.std without specifying an axis argument: 


>> Mean Vals = np.tiean(x train, axis=0) 
Per SLG Vad. = 2p.Sta(x. trei1n) 


>>> X train centered = (X train - mean vals) /std val 
>>> X valid centered = (X valid - mean vals) /std val 
>>> X test centered = (X test - mean vals)/std val 


Now we are ready to implement the CNN we just described. We will proceed by 
implementing the CNN model in TensorFlow. 
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Implementing a CNN in the TensorFlow low- 
level API 


For implementing a CNN in TensorFlow, first we define two wrapper functions to 
make the process of building the network simpler: a wrapper function for a 
convolutional layer and a function for building a fully connected layer. 


The first function for a convolution layer is as follows: 


import tensorflow as tf 
import numpy as np 


def conv layer(input tensor, name, 
Kernel Size, M OULpUE channels, 
padeing MmMode=—"SaME'; Strices=(l, ty dy. 15 
with tf.variable scope (name) : 
## get n input channels: 


it # input tensor shape: 

it i [batch x width x height x channels in] 
Inpul. Shape = anpurt Tensor.der. Ssiape()«as List {) 
i Input channels: = 2nput shape |=.) 

weights shape = list(kernel size) + \ 


[lt 2npue Channels, 1: Culpur Channels] 


weights = tf.get variable (name=' weights', 
shape=weights shape) 
print (weights) 

biases = £f.cet Vartable(name=" biases’, 

initializer=tf.zeros ( 
ehepe- i) CUCU Chantel) 
print (biases) 

Conv = Tislh«COnVvVzZ0 (1 nput=inpuL Tensor, 
filter=weights, 
strides=strides, 
padding=padding mode) 

print (conv) 

COnv = Eisenn.bies edqqicony, biases, 

Neme="NEt pPre-acervalion) 
print (conv) 

conv = tf£.nn.relu(conv, name='activation') 

print (conv) 


return conv 
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This wrapper function will do all the necessary work for building a convolutional 
layer, including defining the weights, biases, initializing them, and the convolution 
operation using the t£.nn.conv2d function. There are four required arguments: 


® input tensor: The tensor given as input to the convolutional layer 

® name: The name of the layer, which is used as the scope name 

@ kernel size: The dimensions of the kernel tensor provided as a tuple or list 
® n output channels: The number of output feature maps 


Notice that the weights are initialized using the Xavier (or Glorot) initialization 
method by default when using tf.get variable (we discussed the Xavier/Glorot 
initialization scheme in Chapter 14, Going Deeper: The Mechanics of TensorFlow), 
while the biases are initialized to zeros using the tf. zeros function. The net pre- 
activations are passed to the ReLU activation function. We can print the operations 
and TensorFlow graph nodes to see the shape and type of tensors. Let's test this 
function with a simple input by defining a placeholder, as follows: 


>>> g = tif.Graph() 
Por WiLL Geas Cetault.() < 
x = tf£.placeholder(tf.float32, shape=[None, 28, 28, 1]) 
Conv: layer(x, fMame=—"Convtce.”, 

Kernel s17e—(57 2); 
a4 if OULDUC Channels =32) 
>>> 
Por OC! GO, &X 
<tf£.Variable 'convtest/ weights:0' shape=(3, 3, 1, 32) 
ClyYpe=—ElLoato7 ret- 
<tf£.Variable 'convtest/ biases:0' shape=(32,) dtype=float32 ref> 
Tensor ("convtest/Conv2D:0", shape=(?, 28, 28, 32), dtype=float32) 
Tensor ("convtest/net pre-activaiton:0", shape=(?, 28, 28, 32), 
dtype=float32) 
Tensor ("convtest/activation:0", shape=(?, 28, 28, 32), dtype=float32) 


The next wrapper function 1s for defining our fully connected layers: 


Gef fc Jayer(imput tensor, mame; 
n output units, activation fn=None) : 
WIth TCisVerlable Scope (hame) > 


Input, Shape = anpuc Tensor.g6et. shape().es Jast() [1] 
Mh 1npul Uns = Npsprod(inpuc. shape) 
Le Len (anput. shape) > 1: 
inpCe. Lensor = (i.resnape (inpuy Tensor, 
shape=(—l, 0 1npuUe Units) ) 
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WelgQntS Shape = [nf 2npul units, n output units) 
WElLQNTS = Li.get varieble(naeme=" welgnts.”, 
shape=weights shape) 
print (weights) 
biases = (f.9get. variable (name=" biases", 
initializer=tf.zeros ( 
shape=[n output _units]) ) 
print (biases) 


Layer = Ci.matmiLiinpur Tensor, welgnts) 
print (layer) 
Layer = Ci.nn.Obas: eaod(lavyer,; Dlases, 


hame="neL pre-eactivarcton”) 
print (layer) 
if achivalion. ta ae. None: 
return layer 


layer = activation tn (layer, name=“eclivation” ) 
print (layer) 
return layer 


The wrapper function fc layer also builds the weights and biases, initializes them 
similar to the conv layer function, and then performs a matrix multiplication using 
the t£.matmul function. The fc layer function has three required arguments: 


® input tensor: The input tensor 
e name: The name of the layer, which is used as the scope name 
® n output units: The number of output units 


We can test this function for a simple input tensor as follows: 


>>> g = tir.Graph() 
PoP WOM Gees. Cetaule() : 
x = tf£.placeholder (tf.float32, 

shape=[None, 28, 28, 1]) 
tC. layer(s, Neme=—clest’y 1 OULpUL Units=32Z, 
S24 ace lVvation n= tl .nn.relu) 
>>> 
PP OCl G, & 
<tf.Variable 'fctest/ weights:0' shape=(784, 32) dtype=float32 ref> 
<tf.Variable 'fctest/ biases:0' shape=(32,) dtype=float32 ref> 
Tensor ("fctest/MatMul:0", shape=(?, 32), dtype=float32) 
Tensor ("fctest/net pre-activaiton:0", shape=(?, 32), dtype=float32) 
Tensor ("fctest/activation:0", shape=(?, 32), dtype=float32) 


The behavior of this function is a bit different for the two fully connected layers in 
our model. The first fully connected layer gets its input right after a convolutional 
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layer; therefore, the input is still a 4D tensor. For the second fully connected layer, 
we need to flatten the input tensor using the t£. reshape function. Furthermore, the 
net pre-activations from the first FC layer are passed to the ReLU activation 
function, but the second one corresponds to the logits, and therefore, a linear 
activation must be used. 


Now we can utilize these wrapper functions to build the whole convolutional 
network. We define a function called build cnn to handle the building of the CNN 
model, as shown in the following code: 


Geek bU11¢ Cnn): 
## Placeholders for X and y: 
tr xX = Cl.placenoloer(li.tloatsz2, Shape=(None;, 134, 
name>" tt x) 
tt YY = Ef.placenolcer(tr.into7, Shape=|(None|, 
name='tf y') 


# reshape x to a 4D tensor: 

# [batchsize, width, height, 1] 

Gl xX ameage = (i.reshape(ti x, sShape=(—-ly 25; 257 Li, 

name='tf x reshaped') 

## One-hot encoding: 

Ee 7 Ofenee. = 2 One Oe lI foOreeo=er Vy, CoplIH lc, 
dtype=tf.float32, 
hame="Tl yy Onenor’) 


## Ist layer: Conv 1 
print('\nBuilding lst layer:') 
hl = Conv. tayert(ct x Umege, Mame—"cony i", 
kernel S126—(), oO), 
padding mode='VALID', 
it OUEpUE Channels =352) 
## MaxPooling 
nl pool = TEsnn.max pool (nal, 
ksize=[1, 2, 2, 1], 
Sstrides=([1, 27 2; ly 
padding='SAME' ) 
## 2n layer: Conv 2 
Pring’ \abustding 2nd layers") 
ha = Gonv Jayer(h! pool, Mame="conv 2", 
Kernel siZe=(5, 3); 
padding mode='VALID', 
n output channels=64) 
## MaxPooling 
h2 pool = tfi.nn.max pool (hz, 
kKeizeHlly 2, Ze Lily 
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Strides—(1, 2, 2% Ll, 
padding='SAME ' ) 


## 3rd layer: Fully Connected 

pring (*\neus ding srd levers") 

ns. = fe layer(n2 pool, fhate="2C o<, 
Mm output unacts=1074, 
ecelvalvon. Liat ain..celu) 


## Dropout 

keep: prob = Ci splacenolder( lT.floaltoZ, Name="Te keep prob”) 

HS Ofop = Lisni.dropouc (no, Keep prob=keep: prob, 
name='dropout layer') 


## Ath layer: Fully Connected (linear activation) 
print('\nBuilding 4th layer:') 
ha = fe Jeyer (hs orop, teme—"tc 4", 

Mm. -OUutpuL. Un tSs=10, 

activation fn=None) 


## Prediction 
predictions = { 
"probabilities': tf.nn.softmax(h4, name='probabilities'), 
"labels': tf.cast(tf.argmax(h4, axis=1), tf.int32, 
name='labels') 


## Visualize the graph with TensorBoard: 


## Loss Function and Optimization 
CEOSS. CnLropy. 10Sss = Ti«reouce Mean 
Cistl.sOLtMax Cross. enuropy with Toqits, 
logits=h4, labels=tf y onehot), 
Name="CLross: Cnecropy hose”) 


## Optimizer: 
OpelImizer = Ci strein.AdamOptimizer (learning rave) 
OplAmLZer = Optimizer -tinim1 Ze (Cross. SnLropy loss, 
hame="tirain Op") 
## Computing the prediction accuracy 
COLrrec. Predictions = Ti,equal( 
predictions['labels'], 
Be. Vy Wel Cor ee. Pp oe 


accuracy = tl.,ecuce mean 
CigCGas te (COrrect. prediclions, “Ef.tloatsZ), 
name='accuracy') 
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In order to get stable results, we need to use a random seed for both NumPy and 
TensorFlow. Setting the TensorFlow random seed can be done at the graph level by 
placing the tf.set random seed function within the graph scope, which we will see 
later. The following figure shows the TensorFlow graph related to our multilayer 
CNN as visualized by TensorBoard: 
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Graph of the multilayer CNN 
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Note that in this implementation, we used the t£.train.AdamOptimizer function for 
training the CNN model. The Adam optimizer is a robust gradient-based 
optimization method suited for nonconvex optimization and machine learning 
problems. Two popular optimization methods inspired Adam: RMSProp and AdaGrad. 


The key advantage of Adam is in the choice of update step size derived from the 
running average of gradient moments. Please feel free to read more about the Adam 
optimizer in the manuscript, Adam: A Method for Stochastic Optimization, Diederik 
P. Kingma and Jimmy Lei Ba, 2014. The article if freely available at 


https://arxiv.org/abs/1412.6980. 


Furthermore, we will define four other functions: save and load for saving and 
loading checkpoints of the trained model, train for training the model using 
training set, and predict to get prediction probabilities or prediction labels of the 
test data. The code for these functions 1s as follows: 


def save(saver, sess, epoch, path='./model/'): 
it Wot: OS.pach.1sSdir (path) = 
os.makedirs (path) 
print ("Saving model in @s' @ path) 
Saver.save(sess, os.path.join(path, 'cnn-model.ckpt'), 
global..step=epocn) 


def load(saver, sess, path, epoch): 
print ('Loading model from %s' % path) 
Saver.restore(sess, os.path. join ( 


path, 'cnn-model.ckpt-7d' % epoch) ) 


Gost train(sess, training set, validation set=None, 
initialize=True, epochs=20, shuffle=True, 
Garopout=-0.5, Landom Seec-None).; 


xX Gata = DpPvarrey (training Seel0]) 
VY date. = 1psarray (training sec!) ) 
thelnino. L0ss = iL 


## initialize variables 
a, debe a eae: 
Sess<TUM(ti.GLobal variables Int teliyer{)) 


np.random.seed(random seed) # for shuflling in batch generator 
for epoch in range(l, epochstl): 
batch gen = balch oeneravor 
mete, Cava, 
shuftfle=shuffle) 
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avg loss = 0<0 

FOr jt, (Dalen &,baten yy) im Snumerate (baccn gen): 
teeo = ("tL 2.0"? batcm x, 

“tae, VU" Dace 77, 

“EC KREGD: PLOO.0"s GdropouT] 

= sess.run ( 

[Cross .SnUeropy Oss:0", “tain Op |, 

Dec GiCl= feed, 

avg, Oe. a= LOae 


loss, 


training loss.append(avg loss / (it1)) 
DrinG (Booch «UzZ0 Training Avg. LOSS: sisotl” « { 
epoch, avg 1oss), ena=" *) 
it Validation seu 16: now None: 
feed = 4°Er x70°s vVelioetion secu), 
VoL Veo Valreariom sect], 
‘EC iReep propsu*s 1.0} 


Vel. 6CC = Seos.tUm accuracy, .0", teed G1clt=teed, 
Pein” Veladetion ACC? <isot” «a Valo. 2Ccc) 

silo 
print () 


OSE pred ce(sess, Xx% test, PEtuIn Proda=False) 
Lee. = AVE hs0"s 2 Cesc, 
“Le. Keep proo.O’s 1.0) 
Lt LEU propa: 
PSLULn. SeSo.LUn( "probabilities: U", Peed CrclL=ieed) 
elec. 
FeCulLn Sese.1Unm(’ labels. 0", esc O1Clr=Leed) 


Now we can create a TensorFlow graph object, set the graph-level random seed, and 
build the CNN model in that graph, as follows: 


>>> ## Define hyperparameters 


>>> learning rate = le-4 
Poo Tanoom seed. = 1235 
>>> 

Pee 


>>> ## create a graph 

>>> g = tf.Graph () 

Po? WIth G.as Cetaultl(): 
Lisset Tangom seed (ranaqom seed) 
## build the graph 
leibpmdbe Met slag ® 


## Saver: 
saver = tf.train.Saver () 
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Note that in the preceding code, after we built the model by calling the build cnn 
function, we created a saver object from the t£.train.Saver class for saving and 
restoring trained models, as we saw in Chapter 14, Going Deeper — The Mechanics 
of TensorFlow. 


The next step is to train our CNN model. For this, we need to create a TensorFlow 
session to launch the graph; then, we call the train function. To train the model for 
the first time, we have to initialize all the variables in the network. 


For this purpose, we have defined an argument named initialize that will take care 
of the initialization. When initialize=True, we will execute 

tf.global_ variables initializer through session. run. This initialization step 
should be avoided 1n case you want to train additional epochs; for example, you can 
restore an already trained model and train further for additional 10 epochs. The code 
for training the model for the first time is as follows: 


>>> ## create a TF session 
>>> ## and train the CNN model 
>>> 
>>> with tf.Session(graph=g) as sess: 
train(sess, 
Glelning Ssevu—(x% train cenvereq, Crain), 
Valicdacionm Sec=(xX. valida Cenverea; YY Valid); 
initialize=True, 
random seed=123) 
Save(saver, sess, epoch=20) 


Epoch O1 Training Avg. Loss: 272.772 Validation Acc: 2973 
Epoch 02 Training Avg. Loss: 76.053 Validation Acc: eo ot 
BpoOcH OS Training Avg. loss: S2.309 Validation: Acc: 0.984 
Epoch 04 Training Avg. Loss: 39.740 Validation Acc: Os 966 
Epoch OS Training. Avo. Loss: 21.500 Validation Acc: Vinod 
Fpoch 19 Training Avg. Loss: Deco. Validation ACC: Oacd 7 
Fpoch 20 Training Avg. Loss: 32309 Valaudation Acc; 422 


Saving model in ./model/ 


After the 20 epochs are finished, we save the trained model for future use so that we 
do not have to retrain the model every time, and therefore, save computational time. 
The following code shows how to restore a saved model. We delete the graph g, then 
create a new graph g2, and reload the trained model to do prediction on the test set: 


>>> ### Calculate prediction accuracy 
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>>> ### on test set 

>>> ### restoring the saved model 

>>> 

>>> del g 

> 

>>> ## create a new graph 

>>> ## and build the model 

>>> g2 = tf£.Graph() 

2 Win. O2.es. Celauly |): 
Ctis«sel random sececa(rangom Seed) 
## build the graph 
loiblniaike Me ciak @, 


## Saver: 
a saver = tf.train.Saver () 
oe 
>>> ## create a new session 
>>> ## and restore the model 
>>> with tf.Session(graph=g2) as sess: 
load(saver, sess, 
epoch=20, path='./model/') 


preds = precicr(sess, % Test centered, 
TSLUrn. Propa-=False) 


O 


print('Test Accuracy: %.3f%%3' 6 (100* 
np.sum(preds == y test)/len(y test))) 


Building lst layer: 
Se tiang 2nd layer: 
Building 3rd layer: 
Building Ath layer: 


Test ACCuUracy. 92.5106 


The output contains several extra lines from the print statements in the build cnn 
function, but they are not shown here for brevity. As you can see, the prediction 
accuracy on the test set is already better than what we achieved using the multilayer 
perceptron in Chapter 13, Parallelizing Neural Network Training with TensorFlow. 


Please, make sure you use xX test centered, which 1s the preprocessed version of 
the test data; you will get lower accuracy if you try using x test instead. 
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Now, let's look at the predicted labels as well as their probabilities on the first 10 test 
samples. We already have the predictions stored in preds; however, in order to have 
more practice in using the session and launching the graph, we repeat those steps 
here: 


>>> ## run the prediction on 
>>> ## some test samples 
PP? NipeSeu. PLAmLoplL1 ons (precilslon=2, Suppress=!1ue) 
>>> 
>>> with tf.Session(graph=g2) as sess: 
load(saver, sess, 
epoch=20, path='./model/') 


PLING (predicu (sess, X~ test. Centerecal<s 10], 
ECLUIM Prodpa=False) ) 


PEIme (Predict (sess, x~ test Cenvercal: 10), 
return proba=True) ) 


Loading model from ./model/ 

INFO: tensorflow:Restoring parameters from ./model/cnn-model.ckpt-20 
[7 2104149 5 9] 

0 


m1 
m1 


so oul 


eeeeceress 
Pee ee ee 

Poe ee a eo 
Dee eee ae 
ee Oe ee ee 
Pewee) ee ee 
ee ee ae ee 
Poe ee eee 
ee ee ee 
owe ee Oe ae 


Finally, let's see how we can train the model further to reach a total of 40 epochs. 
Since, we have already trained 20 epochs from the initialized weights and biases. We 
can save time by restoring the already trained model and continue training for 20 
additional epochs. This will be very easy to do with our setup. We need to call the 
train function again, but this time, we set initialize=False to avoid the 
initialization step. The code 1s as follows: 


## continue training for 20 more epochs 
## without re-initializing :: initialize=False 
## Create a new session 
## and restore the model 
with tf.Session(graph=g2) as sess: 
load(saver, sess, 


WOW! eBook 
www.wowebook.org 


epoch=20, path='./model/') 


train(sess, 
Creiming Set=(xX train Centerecc, Train), 
Valiloeation set=(xX Valad Centered, yy Valid), 
initialize=False, 
epochs=20, 
Pandom: Seeo=i75) 


save(saver, sess, epoch=40, path='./model/') 


precs = preci cliseos, » <est. Centered, 
Lecurn proba=ralse) 


Oo Y O 


print('Test Accuracy: %.3f%%3' 6 (100* 
np.sum(preds == y test)/len(y test) )) 


The result shows that training for 20 additional epochs slightly improved the 
performance to get 99.37 percent prediction accuracy on the test set. 


In this section, we saw how to implement a multilayer convolutional neural network 
in the low-level TensorFlow API. In the next section, we'll now implement the same 
network but we'll use the TensorFlow Layers API. 
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Implementing a CNN in the TensorFlow Layers 
API 


For the implementation in the TensorFlow Layers API, we need to repeat the same 
process of loading the data and preprocessing steps to get X train centered, 

X valid centered, and x test centered. Then, we can implement the model in a 
new class, as follows: 


import tensorflow as tf 
import numpy as np 


class ConvNN (object): 
Cet .41020 .(selt, Datecnsi7e=04,; 
epochs=20, learning rate=le-4, 
GCEOpout rate=U..5, 
shure Le=I[rue; Tancom. Ssecea=None) * 
Np. Cenagom.SSed (ranaom seed) 
selt.Datchsize = Davcheize 
self.epochs = epochs 
Sell elearning tebe: = learning tate 
SCltsOPOpOUL Tate = CrODOUL Fare 
SCLT «SHUT Le: = Snurt le 


g = tf.Graph() 
WIth Guas CStaulc(): 
## set random-seed: 
Loess tango. SoCo (Panicom Sccd) 


#2 Ito the network: 
self. build() 


## initializer 
self.init op = \ 


CieGlobal Vallevics Anitiat1 Zen) 


## Saver 
self.saver = tf.train.Saver() 


## Create a session 
self.sess = tf.Session (graph=g) 


def build(self): 


## Placeholders for X and y: 
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ic 


tf.placeholder (tf.float32, 
shape=[None, 784], 
name='tf x") 
be 7 = The eCenoloer (tC i.age 7, 
shape=[None], 
name='tf y') 
US train = Tisplacecnolder(tr..00el, 
Shape=(), 
HeMe= “1s teat: ) 


## ceshape x to a 4D tensor: 

##  [batchsize, width, height, 1] 

Gr xX 1imege = Li.reshapettt x, Sshape=(-1; 207 26, Lh, 

name='input x 2dimages') 

## One-hot encoding: 

LE YY Onenol = Ti.One HOLiIneices-Li Vv, deptn—Lo, 
dtype=tf.float32, 
hame=" tnpuc _y Onenor’) 


## Ist layer: Conv 1 

hl = Ti.layerssConyvZe(trL & Liege, 
Ketel S12e-(57 0) 7 
filters=32, 
activation=tf.nn.relu) 

## MaxPooling 

hl pool = ti.lavers:max poolingZa(nl, 
pool -S1726e—=(2, 2), 
strides=(2, 2)) 

## 2n layer: Conv 2 

he = tie lavyers.conv7a(nl pool, Kernel @76—(5, 9), 
filters=64, 
activation=tf.nn.relu) 

## MaxPooling 

nz pool, = Gi.lavyers.max poolingZo(pZ, 
poet S1726e—(7, 24 
strides=(2, 2)) 


## 3rd layer: Fully Connected 
Input. Shape = HZ pool.Get shape().as jist {) 


i InpuUL. Unies = Mp.proc(input Snape |< |) 
hZ pool, flat = ti.reshape (hz pool, 
Siape=(—l, M_1npul. units] ) 


hs = tf.layers.dense (hz pool tlavt, 1024, 
activation=tf.nn.relu) 


## Dropout 
NS @rop = Ci.tevers.cropout (no, 
bate=selt<aQrOpour tare, 
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deft 


def 


deft 


Loathing. Train) 


## 4th layer: Fully Connected (linear activation) 
na = 1(t.bayers<censei( no. crop, 10, 
activation=None) 


## Prediction 
predictions = { 
"probabilities': tfi.nn.softmax(h4, 
name='probabilities'), 
"labels': tf.cast(tf.argmax(h4, axis=1), 
tf.int32, name='labels') 


## Loss Function and Optimization 
CPOSs Gneropy .0ss = Tisrequce mean | 
CishhwsOttUmax. Cross Entropy with Jogits( 
logGits=h4, lebels=ctr yy. Onehor) , 
NamMe="Croscs. Cll POpy Lose”) 


## Optimizer: 

OpEImLZer = Cl.«Lrain.AgamOpti mi zer(selt. earning. rate) 

OptlmiZzer = Optimizer. minIm Ze (cross emtropy loss, 
Neane— @ 2a) Op.) 


## Finding accuracy 
COrrect prea icrtions — Ti.equal ( 
predictions['labels'], 

CEL Vy, Name=" Correct preds:) 


accuracy = 4f.re0uce mean« 
ti wceou(COLPeCE pred elions, Li.ttoaleZ) 
name='accuracy') 


save(self, epoch, path='./tflayers-model/'): 
1f not os.path.isdir(path): 
os.makedirs (path) 
print ("Saving model in @s' @ path) 
self.saver.save(self.sess, 
os.path.join(path, 'model.ckpt'), 
Global. Step=Spocn) 


load(self, epoch, path): 
print ('Loading model from %s' % path) 
self.saver.restore(self.sess, 

os.path.join(path, '‘model.ckpt-%d' % epoch) ) 


Crete, Cealn ng. Ser, 
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validation set=None, 
initialize=True): 
## initialize variables 
Le wie 7e 
Sell.S6ss., fun (selr Ima. Op) 


Sseltwtrain cost = 1) 
xX, Gate = Tp.at tay (training Ser), 
Vy Gate. = Dipsaltay (training See!) ) 


for epoch in range(l, self.epochstl): 
batch gen = \ 
batch generator(X data, y data, 
shuffle=self.shuffle) 
avg loss = 0.0 
for i, (batch x,batch y) in \ 
SnUumMerace (DaLlcn OCn): 
free. = 7 °tr x0" = batcn x, 
ribk. VrO*s. Oalen Vv; 
‘is train:0': True} ## for dropout 
10Ss7 . = S6LFwoeoo. run 
L "erOSS Silt op oss", “Crain. oO", 
Peed CiCe——eec) 
avg OSs 7— Oss 


Print Hooch .0zZ0* Training Avg. Loss: ° 
“leon” o ASDOCK, e2Vd LOSS), -ene=* *) 
if Validation See as not None: 
Teco = {°CE x20". Dalen x, 
rie ees ale. Vy 
‘is train:0' : False} ## for dropout 
Vella ace = Sell .sess.fun| accuracy.) ; 
Peed, CiCcl=recq) 
Prine’ Validation ACC: <iscr” 2 Valid. acc) 
elec: 
print () 


Get pPrecicU(selt, ~ test, Lecurn. proba-ralee): 
rece = ("rt 2 © vec, 
‘is train:0' : False} ## for dropout 
Lf LECUrm prooa- 
return self.sess.run('probabilities:0', 
reed ‘dict—teeq) 
else: 
return self.sess.run('labels:0', 
feed dict=feed) 


The structure of this class is very similar to the previous section with the low-level 
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TensorFlow API. The class has a constructor that sets the training parameters, 
creates a graph g, and builds the model. Besides the constructor, there are five major 
methods: 


® .build: Builds the model 

e .save: To save a trained model 

® .load: To restore a saved model 

e .train: Trains the model 

® .predict: To do prediction on a test set 


Similar to the implementation 1n the previous section, we've used a dropout layer 
after the first fully connected layer. In the previous implementation that used the 
low-level TensorFlow API, we used the t£.nn.dropout function, but here we used 
tf£.layers.dropout, which is a wrapper for the t£.nn.dropout function. There are 
two major differences between these two functions that we need to be careful about: 


® tf.nn.dropout: This has an argument called keep prob that indicates the 
probability of keeping the units, while t£.1layers.dropout has a rate 
parameter, which is the rate of dropping units—therefore rate = 1 - 

Keep prob, 

e Inthe t£.nn.dropout function, we fed the keep prob parameter using a 
placeholder so that during the training, we will use keep prob=0.5. Then, 
during the inference (or prediction) mode, we used keep prob=1. However, in 
tf.layers.dropout, the value of rate 1s provided upon the creation of the 
dropout layer in the graph, and we cannot change it during the training or the 
inference modes. Instead, we need to provide a Boolean argument called 
training to determine whether we need to apply dropout or not. This can be 
done using a placeholder of type t£.boo01, which we will feed with the value 
True during the training mode and False during the inference mode. 


We can create an instance of the Convwn class, train it for 20 epochs, and save the 
model. The code for this is as follows: 


>>> cnn = ConvNN (random seed=123) 

>>> 

>>> ## train the model 

Jor Chieti lel i Tiaining Sset—(x% Tain Centered, y Tiel), 
Validation set=—(xX velid Centered, y Valid), 

eva initialize=True) 

>>> cnn.save (epoch=20) 
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After the training is finished, the model can be used to do prediction on the test 
dataset, as follows: 


>>> del cnn 

= 

>>> cnn2 = ConvNN (random seed=123) 

>>> cnn2.load(epoch=20, path='./tflayers-model/') 
>>> 

ee PEINU(COnZ. precLCr ix. test centered |<10, 2))) 


Loading model from ./tflayers-model/ 

INFO: tensorflow:Restoring parameters from ./tflayers-model/model.ckpt- 
20 

[7 2104149 5 9] 


Finally, we can measure the accuracy of the test dataset as follows: 


2p PrLeds. = ConZ.predice (x test Centered) 

>>> 

Poo DPrine( lest Accuracy: s1Ztoe = {1007 
np.sum(y test == preds)/len(y test) )) 


Test Accuracy: 99.32% 


The obtained prediction accuracy is 99.32 percent, which means there are only 68 
misclassified test samples! 


This concludes our discussion on implementing convolutional neural networks using 
the TensorFlow low-level API and TensorFlow Layers API. We defined some 
wrapper functions for the first implementation using the low-level API. The second 
implementation was more straightforward since we could use the tf. layers.conv2d 
and tf.layers.dense functions to build the convolutional and the fully connected 
layers. 
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Summary 


In this chapter, we learned about CNNs, or convolutional neural networks, and 
explored the building blocks that form different CNN architectures. We started by 
defining the convolution operation, then we learned about its fundamentals by 
discussing 1D as well as 2D implementations. 


We also covered subsampling by discussing two forms of pooling operations: max- 
pooling and average-pooling. Then, putting all these blocks together, we built a deep 
convolutional neural network and implemented it using the TensorFlow core API as 
well as the TensorFlow Layers API to apply CNNs for image classification. 


In the next chapter, we'll move on to Recurrent Neural Networks (RNN). RNNs 
are used for learning the structure of sequence data, and they have some fascinating 
applications, including language translation and image captioning! 
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Chapter 16. Modeling Sequential Data 
Using Recurrent Neural Networks 


In the previous chapter, we focused on Convolutional Neural Networks (CNNs) 
for image classification. In this chapter, we will explore Recurrent Neural 
Networks (RNNs) and see their application in modeling sequential data and a 
specific subset of sequential data—time-series data. As an overview, in this chapter, 
we will cover the following topics: 


Introducing sequential data 

RNNs for modeling sequences 

Long Short-Term Memory (LSTM) 

Truncated Backpropagation Through Time (T-BPTT) 

Implementing a multilayer RNN for sequence modeling in TensorFlow 
Project one — RNN sentiment analysis of the IMDb movie review dataset 
Project two — RNN character-level language modeling with LSTM cells, using 
text data from Shakespeare's Hamlet 

e Using gradient clipping to avoid exploding gradients 


Since this chapter is the last in our Python Machine Learning journey, we'll conclude 
with a summary of what we've learned about RNNs, and an overview of all the 
machine learning and deep learning topics that led us to RNNs across the journey of 
the book. We'll then sign off by sharing with you links to some of our favorite people 
and initiatives in this wonderful field so that you can continue your journey into 
machine learning and deep learning. 
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Introducing sequential data 


Let's begin our discussion of RNNs by looking at the nature of sequential data, more 
commonly known as sequences. We'll take a look at the unique properties of 
sequences that make them different from other kinds of data. We'll then see how we 
can represent sequential data, and explore the various categories of models for 
sequential data, which are based on the input and output of a model. This will help us 
explore the relationship between RNNs and sequences a little bit later on 1n the 
chapter. 
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Modeling sequential data — order matters 


What makes sequences unique, from other data types, is that elements in a sequence 
appear in a certain order, and are not independent of each other. 


If you recall from Chapter 6, Learning Best Practices for Model Evaluation and 
Hyperparameter Tuning, we discussed that typical machine learning algorithms for 
supervised learning assume that the input data is Independent and Identically 

AL) (2) [in 
Distributed (IID). For example, 1f we have n data samples, ~  ° a eERe , the 
order in which we use the data for training our machine learning algorithm does not 


matter. 


However, this assumption 1s not valid anymore when we deal with sequences—by 
definition, order matters. 
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Representing sequences 


We've established that sequences are a nonindependent order in our input data; we 
next need to find ways to leverage this valuable information in our machine learning 
model. 


1) (2) | sil 

SIE coach 
Throughout this chapter, we will represent sequences as ( . The 
superscript indices indicate the order of the instances, and the length of the sequence 
is 7. For a sensible example of sequences, consider time-series data, where each 


(r) 
sample point * belongs to a particular time t. 


The following figure shows an example of time-series data where both x's and y's 
naturally follow the order according to their time axis; therefore, both x's and y's are 
sequences: 





The standard neural network models that we have covered so far, such as MLPs and 
CNNs, are not capable of handling the order of input samples. Intuitively, one can 
say that such models do not have a memory of the past seen samples. For instance, 
the samples are passed through the feedforward and backpropagation steps, and the 
weights are updated independent of the order in which the sample is processed. 


RNNs, by contrast, are designed for modeling sequences and are capable of 
remembering past information and processing new events accordingly. 
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The different categories of sequence modeling 


Sequence modeling has many fascinating applications, such as language translation 
(perhaps from English to German), image captioning, and text generation. 


However, we need to understand the different types of sequence modeling tasks to 
develop an appropriate model. The following figure, based on the explanations in the 
excellent article The Unreasonable Effectiveness of Recurrent Neural Networks by 


Andrej Karpathy (http://karpathy.github.10/2015/05/2 1/rnn-effectiveness/), shows 
several different relationship categories of input and output data: 


~ y bk - al 7 - 
f : F 


many-to-one 





many-to-many many-to-many 





So, let's consider the input and output data here. If neither the input or output data 
represents sequences, then we are dealing with standard data, and we can use any of 
the previous methods to model such data. But if either the input or output is a 
sequence, the data will form one of the following three different categories: 


e Many-to-one: The input data is a sequence, but the output is a fixed-size 
vector, not a sequence. For example, 1n sentiment analysis, the input 1s text- 
based and the output is a class label. 
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e One-to-many: The input data is in standard format, not a sequence, but the 
output is a sequence. An example of this category is image captioning—the 
input is an image; the output is an English phrase. 

e Many-to-many: Both the input and output arrays are sequences. This category 
can be further divided based on whether the input and output are synchronized 
or not. An example of a synchronized many-to-many modeling task is video 
classification, where each frame 1n a video is labeled. An example of a delayed 
many-to-many would be translating a language into another. For instance, an 
entire English sentence must be read and processed by a machine before 
producing its translation into German. 


Now, since we know about the categories of sequence modeling, we can move 
forward to discuss the structure of an RNN. 
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RNNs for modeling sequences 


In this section, now that we understand sequences, we can look at the foundations of 
RNNs. We'll start by introducing the typical structure of an RNN, and we'll see how 
the data flows through it with one or more hidden layers. We'll then examine how 
the neuron activations are computed in a typical RNN. This will create a context for 
us to discuss the common challenges in training RNNs, and explore the modern 
solution to these challenges—LSTM. 
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Understanding the structure and flow of an 
RNN 


Let's start by introducing the architecture of an RNN. The following figure shows a 
standard feedforward neural network and an RNN, in a side by side for comparison: 


A standard aa Recurrent 
feedforward ( } neural 


network = network 





Both of these networks have only one hidden layer. In this representation, the units 
are not displayed, but we assume that the input layer (x), hidden layer (h), and output 
layer (y) are vectors which contain many units. 


Note 


This generic RNN architecture could correspond to the two sequence modeling 
categories where the input is a sequence. Thus, it could be either many-to-many if 
(t) 


; } . 
we consider / as the final output, or it could be many-to-one if, for example, we 
(r) 


only use the last element of Yas the final output. 


it) 
Later, we will see how the output sequence Jy can be converted into standard, 


nonsequential output. 


In a standard feedforward network, information flows from the input to the hidden 
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layer, and then from the hidden layer to the output layer. On the other hand, 1n a 
recurrent network, the hidden layer gets its input from both the input layer and the 
hidden layer from the previous time step. 


The flow of information in adjacent time steps in the hidden layer allows the network 
to have a memory of past events. This flow of information 1s usually displayed as a 
loop, also known as a recurrent edge in graph notation, which is how this general 
architecture got its name. 


In the following figure, the single hidden layer network and the multilayer network 
illustrate two contrasting architectures: 









Unfold 
































In order to examine the architecture of RNNs and the flow of information, a compact 
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representation with a recurrent edge can be unfolded, which you can see in the 
preceding figure. 


As we know, each hidden unit in a standard neural network receives only one input 
—the net preactivation associated with the input layer. Now, in contrast, each hidden 
unit in an RNN receives two distinct sets of input—the preactivation from the input 
layer and the activation of the same hidden layer from the previous time step t-1. 


At the first time step ¢ = 0, the hidden units are initialized to zeros or small random 

values. Then, at a time step where t > 0, the hidden units get their input from the data 
At) 

point at the current time ** and the previous values of hidden units at ¢ - /, 


— hi 1} 
indicated as ” 


Similarly, in the case of a multilayer RNN, we can summarize the information flow 
as follows: 


layer = | : Here, the hidden layer is represented as 1 . and gets its input from 
the data point * ‘ and the hidden values in the same layer, but the previous 

pi f—i 
time step ! | 
- - The second hidden layer, hy receives its inputs from the hidden 
units from the layer below at the current time step ( J | ) and its own hidden 

pi f—i 

values from the previous time step : | 


- fayer= 
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Computing activations in an RNN 


Now that we understand the structure and general flow of information in an RNN, 
let's get more specific and compute the actual activations of the hidden layers as well 
as the output layer. For simplicity, we'll consider just a single hidden layer; however, 
the same concept applies to multilayer RNNs. 


Each directed edge (the connections between boxes) in the representation of an RNN 
that we just looked at is associated with a weight matrix. Those weights do not 
depend on time ¢; therefore, they are shared across the time axis. The different 
weight matrices 1n a single layer RNN are as follows: 


At) 
0 Tu : The weight matrix between the input ** and the hidden layer h 


~ Om : The weight matrix associated with the recurrent edge 
W 
e =": The weight matrix between the hidden layer and output layer 


You can see these weight matrices in the following figure: 





W,= |W 


hh Wh | 





_ . v ia 
In certain implementations, you may observe that weight matrices sh and ia are 


W, — [W.,.; WY ih 


concatenated to a combined matrix . Later on, we'll make use of 
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this notation as well. 


Computing the activations 1s very similar to standard multilayer perceptrons and 


other types of feedforward neural networks. For the hidden layer, the net input “1 

(preactivation) is computed through a linear combination. That is, we compute the 

sum of the multiplications of the weight matrices with the corresponding vectors and 
z= Wx) +h, A +8, 

add the bias unit— “! i . Then, the activations of the 

hidden units at the time step ¢ are calculated as follows: 


- — p, {2} ~ p, (Wx * Wii h' : r b, 


Here, b, is the bias vector for the hidden units and ?, (-) is the activation function 


of the hidden layer. 


W, — LW, W it | 


In case you want to use the concatenated weight matrix , the 


formula for computing hidden units will change as follows: 
f (f) 


‘ \ 
t) TT” 
h — = ~, iW xn" " hh W,.| jy" 1) * b, 





Once the activations of hidden units at the current time step are computed, then the 
activations of output units will be computed as follows: 


y= 9,(W,,h° +b, | 


Inv 


To help clarify this further, the following figure shows the process of computing 
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these activations with both formulations: 


Formulation1: h®=¢,(W_x® + W, h)+ b ) 





Final Output: 
(t) = (t) 4 
yv=@ y(W,, h) + b a 





Note 
Training RNNs using BPTT 


The learning algorithm for RNNs was introduced in 1990s Backpropagation 
Through Time: What It Does and How to Do It (Paul Werbos, Proceedings of IEEE, 
78(10):1550-1560, 1990). 

The derivation of the gradients might be a bit complicated, but the basic idea 1s that 


the overall loss L is the sum of all the loss functions at times ! = | to t=f 
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Since the loss at time |? is dependent on the hidden units at all previous time steps 
lL: the gradient will be computed as follows: 

















7 ; ~m a ot aati t 

ar? a? r (4) f an on! 

Ww a ap® | 4 (ik) ay 

OW, ay oh” \aaan) OW, 
on” 





apt) aes , , 
Here, Oh is computed as a multiplication of adjacent time steps: 


Oh 4) it Oh (7) 








ky) (i-l 
eh” aOR 
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The challenges of learning long-range 
interactions 


Backpropagation through time, or BPTT, which we briefly mentioned in the 
previous information box, introduces some new challenges. 


an” 


Because of the multiplicative factor on i in the computing gradients of a loss 
function, the so-called vanishing or exploding gradient problem arises. This 
problem is explained through the examples in the following figure, which shows an 
RNN with only one hidden unit for simplicity: 


Vanishing 
gradient: 


Exploding 
Minl <4 anes |[w,,|>1 Desirable: |w,,| =1 


| _ 





On” 





ap lk) 
Basically, Oh has ! — k multiplications; therefore, multiplying the w weight 


| W < | 


s . . t—K 2 F 
t—k times results ina factor—' . As aresult, if , this factor becomes 


very small when t—K is large. On the other hand, if the weight of the recurrent 
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wi) > | tok _® . 
edge 1s , then ** becomes very large when t—h is large. Note that large 


t—K refers to long-range dependencies. 
Intuitively, we can see that a naive solution to avoid vanishing or exploding gradient 


Ww} = | 

can be accomplished by ensuring . If you are interested and would like to 
investigate this in more detail, I encourage you to read On the difficulty of training 
recurrent neural networks by R. Pascanu, T. Mikolov, and Y. Bengio, 2012 


(https://arxiv.org/pdf/1211.5063.pdf). 


In practice, there are two solutions to this problem: 


e Truncated backpropagation through time (TBPTT) 
e Long short-term memory (LSTM) 


TBPTT clips the gradients above a given threshold. While TBPTT can solve the 
exploding gradient problem, the truncation limits the number of steps that the 
gradient can effectively flow back and properly update the weights. 


On the other hand, LSTM, designed in 1997 by Hochreiter and Schmidhuber, has 
been more successful in modeling long-range sequences by overcoming the 
vanishing gradient problem. Let's discuss LSTM in more detail. 
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LSTM units 


LSTMs were first introduced to overcome the vanishing gradient problem (Long 
Short-Term Memory, S. Hochreiter and J. Schmidhuber, Neural Computation, 9(8): 
1735-1780, 1997). The building block of an LSTM is a memory cell, which 
essentially represents the hidden layer. 


In each memory cell, there is a recurrent edge that has the desirable weight ” = l , 
as we discussed previously, to overcome the vanishing and exploding gradient 
problems. The values associated with this recurrent edge is called cell state. The 
unfolded structure of a modern LSTM cell is shown 1n the following figure: 





(rl) 
Notice that the cell state from the previous time step, , 1s modified to get the 
(t) 
cell state at the current time step, C , without being multiplied directly with any 


weight factor. 


The flow of information in this memory cell is controlled by some units of 
computation that we'll describe here. In the previous figure, © refers to the 
element-wise product (element-wise multiplication) and ““ means element-wise 
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At) 
summation (element-wise addition). Furthermore, * refers to the input data at 


\f—l) 
time ¢, and h indicates the hidden units at time ‘~!. 


Four boxes are indicated with an activation function, either the sigmoid function (7 
) or hyperbolic tangent (tanh), and a set of weights; these boxes apply linear 
combination by performing matrix-vector multiplications on their input. These units 
of computation with sigmoid activation functions, whose output units are passed 


through © , are called gates. 


In an LSTM cell, there are three different types of gates, known as the forget gate, 
the input gate, and the output gate: 


e The forget gate (J ‘) allows the memory cell to reset the cell state without 
erowing indefinitely. In fact, the forget gate decides which information 1s 


allowed to go through and which information to suppress. Now, Si is computed 
as follows: 


a (f) (t—1) | 
i; = o(Wx +W,,he + b, | 


Note that the forget gate was not part of the original LSTM cell; 1t was added a 
few years later to improve the original model (Learning to Forget: Continual 
Prediction with LSTM, F. Gers, J. Schmidhuber, and F. Cummins, Neural 
Computation 12, 2451-2471, 2000). 


e The input gate ({ ) and input node (F 1 ) are responsible for updating the cell 
state. They are computed as follows: 


—_ 5 re | 
i = o(W,,x" +W, hh) 4+ b, | 
g, = tanh | W xl + MW ho he’ 4 b. 
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The cell state at time ¢ 1s computed as follows: 


Cc’ =(C"" 0 f,) ®(i,Og,) 


} , 
e The output gate (* " ) decides how to update the values of hidden units: 


0.=0 (W,,x"" +W, hh) 4 b, | 


id) 


Given this, the hidden units at the current time step are computed as follows: 


h =o, ©tanh (C “) 


The structure of an LSTM cell and its underlying computations might seem too 
complex. However, the good news 1s that TensorFlow has already implemented 
everything in wrapper functions that allows us to define our LSTM cells easily. We'll 
see the real application of LSTMs in action when we use TensorFlow later in this 
chapter. 


Note 


We have introduced LSTMs in this section, which provide a basic approach for 
modeling long-range dependencies in sequences. Yet, it 1s important to note that 
there are many variations of LSTMs described in literature (An Empirical 
Exploration of Recurrent Network Architectures, Rafal Jozefowicz, Wojciech 
Zaremba, and Ilya Sutskever, Proceedings of ICML, 2342-2350, 2015). 


Also, worth noting is a more recent approach, called Gated Recurrent Unit (GRU), 
which was proposed in 2014. GRUs have a simpler architecture than LSTMs; 
therefore, they are computationally more efficient while their performance in some 
tasks, such as polyphonic music modeling, is comparable to LSTMs. If you are 
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interested in learning more about these modern RNN architectures, refer to the paper, 
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling 


by Junyoung Chung and others 2014 (https://arxiv.org/pdf/1412.3555v1.pdf). 
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Implementing a multilayer RNN for 
sequence modeling in TensorFlow 


Now that we introduced the underlying theory behind RNNs, we are ready to move 
on to the more practical part to implement RNNs in TensorFlow. During the rest of 
this chapter, we will apply RNNs to two common problems tasks: 


1. Sentiment analysis 
2. Language modeling 


These two projects, which we'll build together in the following pages, are both 
fascinating but also quite involved. Thus, instead of providing all the code all at 
once, we will break the implementation up into several steps and discuss the code in 
detail. If you like to have a big picture overview and see all the code at once before 
diving into the discussion, we recommend you to take a look at the code 
implementation first, which you can view at https://github.com/rasbt/python- 


machine-learning-book-2nd-edition/blob/master/code/ch16/ch16.1pynb. 


Note, before we start coding in this chapter, that since we're using a very modern 
build of TensorFlow, we'll be using code from the contrib submodule of 
TensorFlow's Python API, in the latest version of TensorFlow (1.3.0) from August 
2017. These contrib functions and classes, as well as their documentation 
references used in this chapter, may change 1n the future versions of TensorFlow, or 
they may be integrated into the t£.nn submodule. We therefore advise you to keep 
an eye on the TensorFlow API documentation 
(https://www.tensorflow.org/api_docs/python/) to be updated with the latest version 
details, in particular, if you have any problems using the tf. contrib code described 
in this chapter. 
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Project one — performing sentiment 
analysis of IMDb movie reviews using 
multilayer RNNs 


You may recall from Chapter 8, Applying Machine Learning to Sentiment Analysis, 
that sentiment analysis 1s concerned with analyzing the expressed opinion of a 
sentence or a text document. In this section and the following subsections, we will 
implement a multilayer RNN for sentiment analysis using a many-to-one 
architecture. 


In the next section, we will implement a many-to-many RNN for an application 
language modeling. While the chosen examples are purposefully simple to introduce 
the main concepts of RNNs, language modeling has a wide range of interesting 
applications such as building chatbot — giving computers the ability to directly talk 
and interact with a human. 
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Preparing the data 


In the preprocessing steps in Chapter 8, Applying Machine Learning to Sentiment 
Analysis, we created a clean dataset named movie _data.csv, which we'll use again 
now. So, first let's import the necessary modules and read the data into a DataFrame 
pandas, as follows: 


oo > IMpPOLL, pypraind 

>>> import pandas as pd 

>>> LLOm String 1mMpOre PpuNnCtUat on 

>>> import re 

>>> import numpy as np 

>>> 

PP OL = DGyread CSV ("MOVie OCala.CoVv’ » Sncodimg="uli=c”*) 


Recall that this af data frame has two columns, namely 'review' and 'sentiment', 
where 'review' contains the text of movie reviews and 'sentiment' contains the 0 
or 1 labels. The text component of these movie reviews are sequences of words; 
therefore, we want to build an RNN model to process the words in each sequence, 
and at the end, classify the entire sequence to 0 or 1 classes. 


To prepare the data for input to a neural network, we need to encode it into numeric 
values. To do this, we first find the unique words in the entire dataset, which can be 
done using sets in Python. However, I found that using sets for finding unique words 
in such a large dataset 1s not efficient. A more efficient way 1s to use Counter from 
the collections package. If you want to learn more about counter, refer to its 
documentation at 


https://docs.python.org/3/library/collections.html#collections.Counter. 


In the following code, we will define a counts object from the counter class that 
collects the counts of occurrence of each unique word in the text. Note that in this 
particular application (and in contrast to the bag-of-words model), we are only 
interested in the set of unique words and won't require the word counts, which are 
created as a side product. 


Then, we create a mapping in the form of a dictionary that maps each unique word, 
in our dataset, to a unique integer number. We call this dictionary word to int, 
which can be used to convert the entire text of a review into a list of numbers. The 
unique words are sorted based on their counts, but any arbitrary order can be used 
without affecting the final results. This process of converting a text into a list of 
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integers 1s performed using the following code: 


>>> ## Preprocessing the data: 

>>> ## Separate words and 

>>> ## count each word's occurrence 
> 

>>> from collections import Counter 


>>> counts = Counter () 

>>> phar = pyprind.ProgBar(len(df['review']), \ 

Sas title="Counting words occurrences') 
>>> for 1,review in enumerate (df['review']): 


text = ''.join([c if c not in punctuation else ' '+ct+' ' \ 
for c in review]).lower () 
df.loc[i, 'review'] = text 


pbar.update () 

ios counts.update (text.split()) 
>>> 
>>> ## Create a mapping 

>>> ## Map each unigue word to an integer 


veo WOPO, COUNTS = SOrCeC(COuUnES, Key-COUuNLS «Gel, Leverse—irue) 
Poe for Lt (oro. Orie |) 7D 

>>> word to int = {word: i1 for ii, word in \ 

eae enumerate (word counts, 1) } 

2S 

PO 


>>> mapped reviews = [|] 
>>> phar = pyprind.ProgBar(len(df['review']), \ 
title='Map reviews to ints') 
>>> for review in df['review']: 
mapped reviews.append([word to int[word] \ 
for word in review.split()]) 
pbar.update () 


So far, we've converted sequences of words into sequences of integers. However, 
there is one issue that we still need to solve—the sequences currently have different 
lengths. In order to generate input data that 1s compatible with our RNN architecture, 
we will need to make sure that all the sequences have the same length. 


For this purpose, we define a parameter called sequence length that we set to 200. 
Sequences that have fewer than 200 words will be left-padded with zeros. Vice 
versa, sequences that are longer than 200 words are cut such that only the last 200 
corresponding words will be used. We can implement this preprocessing step in two 
Steps: 


1. Create a matrix of zeros, where each row corresponds to a sequence of size 200. 
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2. Full the index of words in each sequence from the right-hand side of the matrix. 
Thus, if a sequence has a length of 150, the first 50 elements of the 
corresponding row will stay zero. 


These two steps are shown in the following figure, for a small example with eight 
sequences of sizes 4, 12, 8, 11, 7, 3, 10, and 13: 


A matrix of all zeros 


. ofofolololaljoljo{ojo 
aes -o}ololo}olojo}ololo 
po | o 


Extract indices 
of unique words 


21, 88, 19, 26 

14, 56, 4, 6, 2, 11, 10, 33, 27, 38, 70, 76 

39, 29, 28, 24, 11, 5, 78, 39 

77, 63, 22, 78, 34, 25, 67, 4, 83, 17, 23 

19, 14, 8, 61, 23, 24,4 

23, 42,18 

4.8, 11, 25, 23, 42, 84, 76, 45, 24 

45, 13, 68, 92, 33, 15, 16, 76, 25, 33, 89, 40, 16 


Filling sequences from 
the right side 





Note that sequence length 1s, 1n fact, a hyperparameter and can be tuned for 
optimal performance. Due to page limitations, we did not optimize this 
hyperparameter further, but we encourage you to try this with different values for 
Seq ence eng, such as 50, 100, 200, 250, and 300. 


Check out the following code for the implementation of these steps to create 
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sequences of the same length: 


>>> ## Define same-length sequences 

>>> ## if sequence length < 200: left-pad with zeros 

>>> ## if sequence length > 200: use the last 200 elements 
>>> sequence length = 200 ## (Known as T in our RNN formulas) 


>>> sequences = np.zeros((len(mapped reviews), sequence length), 
ae dtype=int) 
222 
>>> for 1, row in enumerate (mapped reviews): 
review arr = np.array (row) 
sequences |1, —lem( tow) :] = Levtew arc =sequence.lengtn- | 


After we preprocess the dataset, we can proceed with splitting the data into separate 
training and test sets. Since the dataset was already shuffled, we can simply take the 
first half of the dataset for training and the second half for testing, as follows: 


Pee x Claim = sequences |+25000,.%] 

yor VY rain = Of LOC | 2250000, “Sener imenie” | «Velues 
Po? m Best. = Sequences 7250003, .2 

por YY Lest: = Cis lOe (7250002, “Senciment’ | .Values 


Now if we want to separate the dataset for cross-validation, we can further split the 
second half of the data further to generate a smaller test set and a validation set for 
hyperparameter optimization. 


Finally, we define a helper function that breaks a given dataset (which could be a 
training set or test set) into chunks and returns a generator to iterate through these 
chunks (also known as mini-batches).: 


>>> np.random.seed(123) # for reproducibility 


>>> ## Define a function to generate mini-batches: 
per OS Create: Datlch Generator (x, y-None, baten size—o4) : 
n batches = len(x)//batch size 
Mm SX) Daetenes*balca, S176) 
if y 1s not None: 
Yo = Y)en Devches Dalci S276) 
fOr 17 an range(U, ten(x), batch size): 
if y is not None: 
Viele Kita iret Sryel, Vittetaa Daten S176) 
else: 
Vield xi iirbavc S17Ze 


Using generators, as we've done in this code, is a very useful technique for handling 
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memory limitations. This is the recommended approach for splitting the dataset into 
mini-batches for training a neural network, rather than creating all the data splits 
upfront and keeping them in memory during training. 
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Embedding 


During the data preparation in the previous step, we generated sequences of the same 
length. The elements of these sequences were integer numbers that corresponded to 
the indices of unique words. 


These word indices can be converted into input features in several different ways. 
One naive way is to apply one-hot encoding to convert indices into vectors of zeros 
and ones. Then, each word will be mapped to a vector whose size is the number of 
unique words in the entire dataset. Given that the number of unique words (the size 
of the vocabulary) can be in the order of 20,000, which will also be the number of 
our input features, a model trained on such features may suffer from the curse of 
dimensionality. Furthermore, these features are very sparse, since all are zero except 
one. 


A more elegant way is to map each word to a vector of fixed size with real-valued 
elements (not necessarily integers). In contrast to the one-hot encoded vectors, we 
can use finite-sized vectors to represent an infinite number of real numbers (in 

theory, we can extract infinite real numbers from a given interval, for example [-1, 


1]). 


This is the idea behind the so-called embedding, which is a feature-learning 
technique that we can utilize here to automatically learn the salient features to 
represent the words in our dataset. Given the number of unique words unique_words, 
we can choose the size of the embedding vectors to be much smaller than the number 
of unique words (embedding size << unique words) to represent the entire 
vocabulary as input features. 


The advantages of embedding over one-hot encoding are as follows: 


e A reduction in the dimensionality of the feature space to decrease the effect of 
the curse of dimensionality 

e The extraction of salient features since the embedding layer 1n a neural network 
is trainable 


The following schematic representation shows how embedding works by mapping 
vocabulary indices to a trainable embedding matrix: 
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Vocabulary A trainable matrix of type real 
indices 


Number 
P of unique 
words 


(or Fe) 


Number of features 
(or embedding size) 





TensorFlow implements an efficient function, tf£.nn.embedding lookup, that maps 
each integer that corresponds to a unique word, to a row of this trainable matrix. For 
example, integer | is mapped to the first row, integer 2 1s mapped to the second row, 
and so on. Then, given a sequence of integers, such as <0, 5, 3, 4, 19, 2...>, we need 
to look up the corresponding rows for each element of this sequence. 


Now let's see how we can create an embedding layer in practice. If we have tf x as 
the input layer where the corresponding vocabulary indices are fed with type 
tf.int32, then creating an embedding layer can be done in two steps, as follows: 


| | | E words x embedding size] 
1. We start by creating a matrix of size * ~~ = asa 
tensor variable, which we call embedding, and we initialize its elements 


randomly with floats between [-1, 1]: 
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embedding = tf.Variable ( 
Listendom Untrorm( 
shape=(n words, embedding size), 
minval=-1, maxval=1) 
) 
2. Then, we use the t£.nn.embedding lookup function to look up the row in the 


embedding matrix associated with each element of tf x: 


embed x = tf.nn.embedding lookup(embedding, tf x) 


Note 


As you may have observed 1n these steps, to create an embedding layer, the 
tf.nn.embedding lookup function requires two arguments: the embedding tensor 
and the lookup IDs. 


The t£.nn.embedding lookup function has a few optional arguments that allow you 
to tweak the behavior of the embedding layer, such as applying L2 normalization. 
Feel free to read more about this function from its official documentation at 


https://www.tensorflow.org/api_docs/python/tf/nn/embedding lookup. 
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Building an RNN model 


Now we're ready to build an RNN model. We'll implement a Sentiment RNN class 
that has the following methods: 


e A constructor to set all the model parameters and then create a computation 
graph and call the self.build method to build the multilayer RNN model. 

e A build method that declares three placeholders for input data, input labels, 
and the keep-probability for the dropout configuration of the hidden layer. After 
declaring these, it creates an embedding layer, and builds the multilayer RNN 
using the embedded representation as input. 

e A train method that creates a TensorFlow session for launching the 
computation graph, iterates through the mini-batches of data, and runs for a 
fixed number of epochs, to minimize the cost function defined in the graph. 
This method also saves the model after 10 epochs for checkpointing. 

e A predict method that creates a new session, restores the last checkpoint saved 
during the training process, and carries out the predictions for the test data. 


In the following code, we'll see the implementation of this class and its methods 
broken into separate code sections. 
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The SentimentRNN class constructor 


Let's start with the constructor of our SentimentRNN class, which we'll code as 
follows: 


import tensorflow as tf 


class SentimentRNN (object): 
def. 2niG (self, m:-words, seq .ten—200, 
LSM. Si Ze=250, num leavyers—-l, Dalch size=-o4, 
learning rate=0.0001, embed size=Z200) : 


self.n words = n words 

Selt.seqd 16m = seq ton 

self.lstm_size = lstm_size ## number of hidden units 
seli «hum. Leyers: = NUM tayers 

sel’ .Dalch c17e = Detcn size 

seit .tearming rere = toarning rate 

Sell «<embed: Ssi7e-=— emped Size 


self.g = tf.Graph() 
WLU. Selt.d.as OeraulLt() : 
Ci «set. rendom. S6e0( i235) 


seit. Dud1e.() 
self.saver = tf.train.Saver() 
Seli sitll Op = Ul.global VWarlebiles 12no tial 176K {) 


Here, the n words parameter must be set equal to the number of unique words (plus 
1 since we use zero to fill sequences whose size is less than 200) and it's used while 
creating the embedding layer along with the embed size hyperparameter. 
Meanwhile, the seq len variable must be set according to the length of the 
sequences that were created 1n the preprocessing steps we went through previously. 
Note that 1stm_ size 1s another hyperparameter that we've used here, and it 
determines the number of hidden units in each RNN layer. 
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The build method 


Next, let's discuss the build method for our Sent imentRNN Class. This is the longest 
and most critical method in our sequence, so we'll be going through it in plenty of 
detail. First, we'll look at the code in full, so we can see everything together, and 
then we'll analyze each of its main parts: 


def build(self): 

## Define the placeholders 

i & = CE.placeioloer(tL.anvoZ, 
Siape= (sell sDeace? S126, seltsceco 12a), 
Mame="tE x”) 

be YY = Ti. pleCenol oer (li.tloat jo, 
shape=—(selit.batch s17e):,; 
Neme= try") 

LE Keepprob = TE.placenolder(tt.tloatoz, 
name='tf keepprob') 


## Create the embedding layer 
embedding = tf.Variable ( 
CE«.rencom. Uni torn 
(seli.m words, .selt.embed Size) , 
minval=-l, maxval=1), 
name='embedding' ) 
embed x = tf.nn.embedding lookup ( 
embedding, tf x, 
lane—"embecea x) 


## Define LSTM cell and stack them together 
cells = tf.contrib.rnn.MultiRNNCell ( 
[tf£.contrib.rnn.DropoutWrapper ( 
Ci-COnt ri D.1nn.bastChoIMCelLiLiseli<ilstm Size), 
OUGpUuL KeCp DYrOO=EE. KeGppron) 
for 1 in range(self.num layers) ]) 


## Define the initial state: 
SCltsiMitial State = Cells.zero Stace 
SelEs batch S146, Lie loader Z) 
PEernc( <= Antttal Stabe 2> “, SeClLi Initial Stee) 


LSU. CUlpUCS, SelLi.Timel Stave = Tismn.Cyoamic (an | 
cells, Gmoed: x; 
Pee, Seete=o- ele, Stare) 


## Note: lIstm outputs shape: 


## [batch size, max time, cells.output size] 
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print(’\n << tstm curtpuc vo A”. Stn: OuLouts) 
print('\n << final state va 7. Seltsrine. Seale) 


logits = tf.layers.dense ( 
inpucs=lsum outputs |S; =Llh, 
units=l, activation=None, 
name='logits') 


HOGS = Gi .soueeZze( LOGglts, Dane OCs. squee7ze0,".) 
print ('\n << logits Pe ¢ ROOTS) 


VY proba = Ti.ni.s10mo10(1log1ts,; Mame=" probabilities”) 
predictions = { 
"Probevlitities’s y proba, 
“Lebpewse” =§ LisCest(ti.rOunGc(y DrOba);, isi nloZ, 
name='labels') 


} 


prant(’\n << predictions eo Ty DPrediceions) 


## Define the cost function 

COSt = Ti sreqauce mean 
Elwes lOmMOud. ©0Ss Ciuleopy Wied Jog1 bs 
Labels=ci VY, -20G1ts=Logi ts), 
name='cost') 


## Define the optimizer 
OpLimuzer = Ti .train.ACamOptimizer (Sselr. learning rate) 
Liain Op = Oplili.zer. Tinimaze(COst, Nalie="Erain Op) 


So first of all in our build method here, we created three placeholders, namely tf x, 
tf y,andt£ keepprob, which we need for feeding the input data. Then we added 
the embedding layer, which builds the embedded representation embed x, as we 
discussed earlier. 


Next, in our build method, we built the RNN network with LSTM cells. We did this 
in three steps: 


1. First, we defined the multilayer RNN cells. 
2. Next, we defined the initial state for these cells. 
3. Finally, we created an RNN specified by the RNN cells and their initial states. 


Let's break these three steps out in detail in the following three sections, so we can 
examine in depth how we built the RNN network in our build method. 
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Step 1 — defining multilayer RNN cells 


To examine how we coded our build method to build the RNN network, the first 
step was to define our multilayer RNN cells. 


Fortunately, TensorFlow has a very nice wrapper class to define LSTM cells—the 
BasicLSTMCel1 class—which can be stacked together to form a multilayer RNN 
using the Mult iRNNCell wrapper class. The process of stacking RNN cells with a 
dropout has three nested steps; these three nested steps can be described from inside 
out as follows: 


1. First, create the RNN cells using t£.contrib.rnn.BasicLSTMCell. 

2. Apply the dropout to the RNN cells using t£.contrib.rnn.DropoutWrapper. 

3. Make a list of such cells according to the desired number of RNN layers and 
pass this list to t£.contrib.rnn.MultiRNNCell. 


In our build method code, this list is created using Python list comprehension. Note 
that for a single layer, this list has only one cell. 


Note 


You can read more about these functions at the following links: 


® tf.contrib.rnn.BasicLSTMCel1:https://www.tensorflow.org/ap1_docs/python/t 


®e tf.contrib.rnn.DropoutWrapper: 


https://www.tensorflow.org/api_docs/python/tf/contrib/mn/DropoutWrapper 


@®@ tf.contrib.rnn.MultiRNNCell: 
https://www.tensorflow.org/api1_docs/python/tf/contrib/mn/MulttRNNCell 
Step 2 — defining the initial states for the RNN cells 


The second step that our build method takes to build the RNN network was to 
define the initial states for the RNN cells. 


You'll recall from the architecture of LSTM cells, there are three types of inputs in 
(f) 
an LSTM cell—input data * _, activations of hidden units from the previous time 


ced | 


\f—L) 
step 1, and the cell state from the previous time step 
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So, in our build method implementation, * ‘is is the embedded embed _ x data 
tensor. However, when we evaluate the cells, we also need to specify the previous 
state of the cells. So, when we start processing a new input sequence, we initialize 
the cell states to zero state; then after each time step, we need to store the updated 
state of the cells to use for the next time step. 


Once our multilayer RNN object is defined (cells in our implementation), we define 
its initial state in our build method using the cells.zero state method. 


Step 3 — creating the RNN using the RNN cells and their 
States 


The third step to creating the RNN in our build method, used the 
tf.nn.dynamic_rnn function to pull together all our components. 


The t£.nn.dynamic rnn function therefore pulls the embedded data, the RNN cells, 
and their initial states, and creates a pipeline for them according to the unrolled 
architecture of LSTM cells. 


The t£.nn.dynamic rnn function returns a tuple containing the activations of the 
RNN cells, outputs; and their final states, state. The output 1s a three-dimensional 
tensor with this shape— (batch size, num steps, lstm size). We pass outputs 
to a fully connected layer to get logits and we store the final state to use as the 
initial state of the next mini-batch of data. 


Note 


Feel free to read more about the t£.nn.dynamic rnn function at its official 
documentation page at 


https://www.tensortlow.org/api_docs/python/tf/nn/dynamic_rnn. 


Finally, in our build method, after setting up the RNN components of the network, 
the cost function and optimization schemes can be defined like any other neural 
network. 
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The train method 


The next method in our SentimentRNN Class 1s train. This method call is quite 
similar to the train methods we created in Chapter 14, Going Deeper — The 
Mechanics of TensorFlow and Chapter 15, Classifying Images with Deep 
Convolutional Neural Networks except that we have an additional tensor, state, that 
we feed into our network. 


The following code shows the implementation of the train method: 


cet train(selt, x Grain, y train, Num. epochs) : 
with tf.Session(graph=self.g) as sess: 
Seoee) UL Selita Op) 


Ieeracion = 1 
fOr €poch. 1m, range (num: epocns) + 
Stave — Sooo. 7Un(selt.,jdnttteal Stace) 


fOr batch. &, batch Y 1m Create Daten Generator, 
KX Stein, Y trait, selt,bacen size) 
Lec = "Er Ore Pale x, 
io Ve" = Daven: Vy 
"itt .Keepprob. 0’ * U.S; 
Selisitnitial stave = Stave) 
LOSS; 7» State = Sese. run | 
[“COsceU=, “Crain Co, 
SC End Mia. eee ly 
reed C1ce=16eq) 
1f iteration % 20 == 
print ("Epoch: %$d/%d Iteration: %d " 
"| Train 2oss: ¢.525" 2 
epoch + 1, num epochs, 
iteration, loss) ) 


iteration +=1 
1f (epocht+l1)%10 == 0: 
self.saver.save(sess, 
"model/sentiment-%d.ckpt" % epoch) 


In this implementation of our train method, at the beginning of each epoch, we start 
from the zero states of RNN cells as our current state. Running each mini-batch of 
data is performed by feeding the current state along with the data batch _ x and their 
labels batch y. Upon finishing the execution of a mini-batch, we update the state to 
be the final state, which 1s returned by the t£.nn.dynamic_rnn function. This 
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updated state will be used toward execution of the next mini-batch. This process is 
repeated and the current state is updated throughout the epoch. 
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The predict method 


Finally, the last method in our SentimentRNN class is the predict method, which 
keeps updating the current state similar to the train method, shown 1n the following 


code: 


Get predict(selt, A Gata, return prodva—False) = 


preds = [] 
with tf.Session(graph = self.g) as sess: 
self.saver.restore ( 
sess, tfi.train.latest checkpoint ('./model/')) 
Ceo Stale = Sess. Ul slit irik Ste.) 


for 11, batch x in enumerate ( 
Credle DalcCn Generacor | 
xX data; None, Dalen Sivze-selt.batch e176), 1) 
Lece: = (hE ae & Daren oy 
“EE KeEGpprOoo.0* 2 1.0, 
SCli stale slece = beoc = ae) 
Lf Seturn proba: 
Prec; test State = sesse.run 
| "Oba LREEteotU, SelLi. ime. scare, 
Feed CLCE=Leea) 
else: 
Prec, Test State = sess.run' 
[ tebele 0", scktetona! Stace, 
reed GicCl—=1Sseq) 


preds.append (pred) 


return np.concatenate (preds) 
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Instantiating the SentimentRNN class 


We've now coded and examined all four parts of our SentimentRNN class, which 
were the class constructor, the build method, the train method, and the predict 
method. 


We are now ready to create an object of the class Sentiment RNN, with parameters as 
follows: 


Por Ti WOLdS = Max (list (word to 1nt.values())) + 1 
>>> 
>>> rnn = SentimentRNN(n words=n words, 


Seq -en=sequence enor, 
embed s1ze=2506, 

IStm, SiZe=120, 

num layers=l, 

batch s1.26=100,; 

learning rate=0.001) 


Notice here that we use num_layers=1 to use a single RNN layer. Although our 
implementation allows us to create multilayer RNNs, by setting num layers greater 
than 1. Here we should consider the small size of our dataset, and that a single RNN 
layer may generalize better to unseen data, since it is less likely to overfit the training 
data. 
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Training and optimizing the sentiment analysis 
RNN model 


Next, we can train the RNN model by calling the rnn.train function. In the 
following code, we train the model for 40 epochs using the input from x_ train and 
the corresponding class labels stored in y train: 


por Tite train(® train, Y. Erain, mum pocnis—40) 

Epoch: 1740 Tceration: 20 | Train toss: 0.70637 
Fpoch: 1740. iteration: 40 | Train loss? 0.60539 
Epoch: 1/40 Iteration: 60 | Train loss: 0.66977 
Fpoch: 1/740 Tteration: 80 | Train loss: 0.51997 


The trained model is saved using TensorFlow's checkpointing system, which we 
discussed in Chapter 14, Going Deeper — The Mechanics of TensorFlow. Now, we 
can use the trained model for predicting the class labels on the test set, as follows: 


yee PecGs = Li reCten. % test) 
Poe y Lie = VY Les. -ten4p reds) | 
>>> prance ( Test ACC... c.3f* @ { 
np.sum(preds == y true) / len(y true))) 


The result will show an accuracy of 86 percent. Given the small size of this dataset, 
this is comparable to the test prediction accuracy obtained in Chapter 8, Applying 
Machine Learning to Sentiment Analysis. 


We can optimize this further by changing the hyperparameters of the model, such as 
lstm size, seq len, and embed _ size, to achieve better generalization performance. 
However, for hyperparameter tuning, it is recommended that we create a separate 
validation set and that we don't repeatedly use the test set for evaluation to avoid 
introducing bias through test data leakage, which we discussed in Chapter 6, 
Learning Best Practices for Model Evaluation and Hyperparameter Tuning. 


Also, if you're interested 1n the prediction probabilities on the test set rather than the 
class labels, then you can set return proba=True as follows: 


yoo Probe = Tip Solcuty test, £eorur Pprobe=l rue) 


So this was our first RNN model for sentiment analysis. We'll now go further and 
create an RNN for character-by-character language modeling in TensorFlow, as 
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another popular application of sequence modeling. 
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Project two — implementing an RNN for 
character-level language modeling in 
TensorFlow 


Language modeling 1s a fascinating application that enables machines to perform 
human-language-related tasks, such as generating English sentences. One of the 
interesting efforts in this area is the work done by Sutskever, Martens, and Hinton 
(Generating Text with Recurrent Neural Networks, Ilya Sutskever, James Martens, 
and Geoffrey E. Hinton, Proceedings of the 28th International Conference on 
Machine Learning (ICML-1I1), 2011 


https://pdfs.semanticscholar.org/93c2/Oe38c85b69fc2d2eb3 14b3c1217913f7db11.pdt 


In the model that we'll build now, the input 1s a text document, and our goal is to 
develop a model that can generate new text similar to the input document. Examples 
of such an input can be a book or a computer program in a specific programming 
language. 


In character-level language modeling, the input 1s broken down into a sequence of 
characters that are fed into our network one character at a time. The network will 
process each new character in conjunction with the memory of the previously seen 
characters to predict the next character. The following figure shows an example of 
character-level language modeling: 


“Hello world!” Break it into a sequence: 


next 
character: 


Input | y 
Sequence: er Pe ave ca 
Predicting , 

e 5 ‘|’ 





We can break this implementation down into three separate steps—preparing the 
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data, building the RNN model, and performing next-character prediction and 
sampling to generate new text. 


If you recall from the previous sections of this chapter, we mentioned the exploding 
gradient problem. In this application, we'll also get a chance to play with a gradient 
clipping technique to avoid this exploding gradient problem. 
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Preparing the data 


In this section, we prepare the data for character-level language modeling. 


To get the input data, visit the Project Gutenberg website at 
https://www.gutenberg.org/, which provides thousands of free e-books. For our 
example, we can get the book The Tragedie of Hamlet by William Shakespeare in 


plain text format from http://www.gutenberg.org/cache/epub/2265/pg2265.txt. 


Note that this link will directly take you to the download page. If you are using 
macOS or a Linux operating system, you can download the file with the following 
command in the Terminal: 


curl http://www.gutenberg.org/cache/epub/2265/pg2265.txt > pg2265.txt 


If this resource becomes unavailable in future, a copy of this text is also included in 
this chapter's code directory in the book's code repository at 


https://github.com/rasbt/python-machine-learning-book-2nd-edition. 


Once we have some data, we can read it into a Python session as plain text. In the 
following code, the Python variable chars represents the set of unique characters 
observed in this text. We then create a dictionary that maps each character to an 
integer, char2int, and a dictionary that performs reverse mapping, for instance, 
mapping integers to those unique characters—int2char. Using the char2int 
dictionary, we convert the text into a NumPy array of integers. The following figure 
shows an example of converting characters into integers and the reverse for the 
words "Hello" and "world": 


Mapping characters to integers Mapping integers to characters 


2 
cs 


char2int int2char 


0 
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This code reads the text from the downloaded link, removes the beginning portion of 
the text that contains some legal description of the Gutenberg project, and then 
constructs the dictionaries based on the text: 


>>> 
>>> 
>>> 
>>> 
eo 
>>> 


>>> 
>>> 


import numpy as np 

## Reading and processing text 

With open ("pg2265.txt', ‘r', encoding='uti=38") as. Tf: 
text=f.read () 

text = text[15858: ] 


chars = set(text) 

Char2int = {ch:1 for 1,ch in enumerate (chars) } 
int2char = dict (enumerate (chars) ) 

Coxe Jnts = fp.etreay( (ener ne lCh]) for Ci 129 Text], 


dtype=np.int32) 


Now, we should reshape the data into batches of sequences, the most important step 
in preparing data. As we know, the goal is to predict the next character based on the 
sequence of characters that we have observed so far. Therefore, we shift the input (x) 
and output (y) of the neural network by one character. The following figure shows 
the preprocessing steps, starting from a text corpus to generating data arrays for x 
and y: 
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Text Corpus 


Convert text into a long 
sequence of integers 


49, 29, 29, 29, 5, 19, 27, 0, 7, 3, 36, 65, 27, 
41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, 
12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, 
(22,..., 52, 84, 19, 31, 0, 22 


Create sequences x and y 
Sequence x: 

(49, 29, 29, 29, 5,19, 27, 0, 7, 3, 36, 65, 27, 
41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, 
12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, 
22,..., 52, 84,19, 31, 0, 22 


sequence y: 
49, 29, 29, 29, 5,19, 27, 0, 7, 3, 36, 65, 27, 
41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, 
12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, 
22,..., 52, 84, 19, 31, 0, 22 





Training data array x: 








49, 29, 29,29, 5,19,...,41, 31 
73, 11,56, 0, 36, 28,..., 72,45 





Batch 
19, 22, 31, 67,12, O,...,12, 0 nize 
22,51,51, 0,51,52,... , 86, 42 
Training data array y: 
29, 29,29, 5,19, 27, 
11,56, 0, 36, 28, O, \ Batch 
23, 31, 67,12, 0; 4865. size 


51,51, 0,51,52, 4,... 





Number of batches * Number of steps 


As you can see 1n this figure, the training arrays x and y have the same shapes or 
dimensions, where the number of rows 1s equal to the batch size and the number of 


columns 1s 


number of batches x number of steps 


Given the input array data that contains the integers that correspond to the characters 
in the text corpus, the following function will generate x and y with the same 


structure shown in the previous figure: 


Por (er reshape cavaisequence, batch S176, Num steps): 
GOL: Dabch Jength = batch: Size. * num Steps 
num batches = int(len(sequence) / tot_batch_ length) 
it Mum DartCches*LOe. Daren lenguen « 1 2 Men(sequence) = 
num Deateches = mum batches = 1 
## Truncate the sequence at the end to get rid of 
## remaining charcaters that do not make a full batch 
x = sequence! 0s num Davcnes*ToL Datcn Leng] 
yY = Sequence |i: num Datenes*tor Daten Jengen - 1] 
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## Split x & y into a list batches of sequences: 
x Datch Spltbs = npseplitix, Darch S176) 

Vy Datch. Splits = np.splilty, Daven 5176) 

## Stack the batches together 

## batch size x tot batch length 
Npstacki(x< DeakCn splits) 
Np.Stack(y Detech Splivs) 


x 
y 


reuurn. xX, ¥ 


The next step is to split the arrays x and y into mini-batches where each row 1s a 
sequence with length equal to the number of steps. The process of splitting the data 
array x 1S Shown in the following figure: 


Training ¢ 
49, 29, 29,29, 5,19, 27, O,... 
73,11, 56, 0, 36, 28, 0, 31,... 
19, 22, 31, 67, 12, 0, 48, 3,... 


22, ok, ol, UF ol, ot, 4, £7, «2. 


Batch 1 Batch 2 Batch n 
49, 29, 29, 29 5,19, 27, 0O O, 27, 41, 31 
73, 11,56, O 36, 28, 0, 31 4,36, 72, 45 
19, 22, 31, 67 12, 0,48, 3 | & 86, 52,12, O 


22,51, 51, 0|| 51,52, 4, 27 0, 52, 86, 42 
eee 2 


Number Number Number 
of steps of steps of steps 





In the following code, we define a function named create batch generator that 
splits the data arrays x and y, as shown in the previous figure, and outputs a batch 
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generator. Later, we will use this generator to iterate through the mini-batches during 
the training of our network: 


Per OC, CYCave DatGcn Generator (data x, Gata VY, NUM Steps): 
batch S176, Col. Datch Aengen = Geta. xsshape 
num batches = int(tot batch length/num_ steps) 
for Db In range (num batches) : 
Vield (date, x12, b*hum steps< (oT) num Steps|, 
Gate: Vit, O° num. Steps: (O71) “num. Steps|) 


At this point, we've now completed the data preprocessing steps, and we have the 
data in the proper format. In the next section, we'll implement the RNN model for 
character-level language modeling. 
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Building a character-level RNN model 


To build a character-level neural network, we'll implement a class called charRNnN 
that constructs the graph of the RNN in order to predict the next character, after 
observing a given sequence of characters. From the classification perspective, the 
number of classes 1s the total number of unique characters that exists in the text 
corpus. The charrwnwn class has four methods, as follows: 


e A constructor that sets up the learning parameters, creates a computation graph, 
and calls the build method to construct the graph based on the sampling mode 
versus the training mode. 

e A build method that defines the placeholders for feeding the data, constructs 
the RNN using LSTM cells, and defines the output of the network, the cost 
function, and the optimizer. 

e A train method to iterate through the mini-batches and train the network for 
the specified number of epochs. 

e A sample method to start from a given string, calculate the probabilities for the 
next character, and choose a character randomly according to these 
probabilities. This process will be repeated, and the sampled characters will be 
concatenated together to form a string. Once the size of this string reaches the 
specified length, 1t will return the string. 


We'll break these four methods into separate code sections and explain each one. 
Note that implementing the RNN part of this model 1s very similar to the 
implementation in the Project one — performing sentiment analysis of IMDb movie 
reviews using multilayer RNNs section. So, we'll skip the description of building the 
RNN components here. 
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The constructor 


In contrast to our previous implementation for sentiment analysis, where the same 
computation graph was used for both training and prediction modes, this time our 
computation graph is going to be different for the training versus the sampling mode. 


Therefore we need to add a new Boolean type argument to the constructor, to 
determine whether we're building the model for the training mode or the sampling 
mode. The following code shows the implementation of the constructor enclosed in 
the class definition: 


import tensorflow as tf 
import os 


class CharRNN (object) : 
Cet df1t. iHeelt, um Classes, Daten S176—0o4, 
num. Seeps—100,. Jetm si ze-iZ0, 
num _layers=l, learning rate-U.001, 
keep. prob=0...5, Grad. Clap=5; 
sampling=False): 


Selt.NuM Classes =. Num Classes 
Selieberen Gaze = Dolch o1276 
self.num steps = num steps 
Pelti«locm Size = Ise Size 

seli «uM. ayers: = NUM Layers 
Sselt.tearning Tete = bearning. rete 


Sseit«Keep Pron. = Keep prox 
Seltieotad. Clip = Grad. clip 


self.g = tf.Graph () 

with selft.g.as default (): 
Ci «Sel. rendom. seed( zs) 
self.build(sampling=sampling) 


self.saver = tf.train.Saver() 


Seli elie. Op = Ulsg@lobal Variables initializer {) 


As we planned earlier, the Boolean sampling argument is used to determine whether 
the instance of CharRNN 1s for building the graph 1n the training mode 
(sampling=False) or the sampling mode (sampling=True). 


In addition to the sampling argument, we've introduced a new argument called 
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grad clip, which 1s used for clipping the gradients to avoid the exploding gradient 
problem that we mentioned earlier. 


Then, similar to the previous implementation, the constructor creates a computation 
graph, sets the graph-level random seed for consistent output, and builds the graph 
by calling the build method. 
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The build method 


The next method of the cCharRwn class 1s build, which is very similar to the build 
method in the Project one — performing sentiment analysis of IMDb movie reviews 
using multilayer RNNs section, except for some minor differences. The build 
method first defines two local variables, batch size and num steps, based on the 
mode, as follows: 


| | | batch_size =| 
in sampling mode: 
| num_steps =] 


| _ | batch_size = self batch size 
in training mode: | 
|n um steps = self num_steps 


Recall that in the sentiment analysis implementation, we used an embedding layer to 
create a salient representation for the unique words in the dataset. In contrast, here 
we are using the one-hot encoding scheme for both x and y with 

depth=num classes, where num classes 1S in fact the total number of characters in 
the text corpus. 


Building a multilayer RNN component of the model is exactly the same as 1n our 
sentiment analysis implementation, using the tf.nn.dynamic_rnn function. 
However, outputs from the tf£.nn.dynamic rnn function is a three-dimensional 
tensor with this shape—batch size, num steps, lstm size. Next, this tensor will 
be reshaped into a two-dimensional tensor with the batch size*num steps, 

lstm size Shape, which 1s passed to the tf.1layers.dense function to make a fully 
connected layer and obtain logits (net inputs). Finally, the probabilities for the next 
batch of characters are obtained and the cost function 1s defined. In addition, here, 
we apply gradient clipping using the t£.clip by global norm function to avoid the 
exploding gradient problem. 


The following code shows the implementation of what we've just described for our 
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new build method: 


def build(se 
if sampl 


It, Sampling): 
ing == True: 


beaven size, DUM Steps = i, J 


else: 


Dbetch: S176 — Seltsoalcn Size 


Num | 


ec x 


t 


ce y= 


Ll. Keepp 


# One-ho 


x Onehot = 


y onehot 


### Buil 
cells = 
Pits 


io a 


## Defin 
self.ini 


## Run e 


Sleps. = Sseli.mum sleps 


f.placeholder (tf.int32, 
shape=(DatCh size, num steps], 
name="Cr x") 

f.placeholder (tf.int32, 
shape=|(batch size, num.steps],; 
name='tf y') 

rob = tf.placeholder(tf.float32, 
name='tf keepprob') 


t encoding: 
LE2One NOLL x, CSplh=selrT.num Classes) 
=—TEvOne MNOU(LE Vr CoplLn—sell.num Classes) 


d the multi-layer RNN cells 

tf. contrib.rnn.MultiRNNCell ( 
contrib.rnn.DropoutWrapper ( 

EL sCOnLrib.inn. Basi CluoIMCel Tl Cselt. sim 61726), 
OULPUL. KEGp Prob—=CtT Keeppron) 

in range(self.num layers) |) 


© Lhe 2n1tial State 
(ie: Sele = Celle aZeree Slace | 


beacech 21267 te.t loa o2) 


ach sequence step through the RNN 


IStm OUlpUuts, seli.tinal, Stave = Ti.na-Cynemic tint 


print (' 


seq outp 


logits = 


Cells, XK ONneCnor, 
ie, Seele— el iIeriel ee ee) 


<< iistm Outpurcs. 23", IStm ouLpucs) 


UL. £echeped = Li.resnape | 
SLM, -OULDUES; 
Siepe=(—l, sell. Sem o1L4e] , 
neme=" Seq Output reshapea”} 


tf.layers.dense ( 
IMpuULS=seq Output Teshaped, 
UnVes=secli enum: Chasses, 
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activation=None, 
name='logits') 


proba = tfi.nn.softmax ( 
logits, 
name='probabilities') 


y ©esnapea = Ti.reshapes 
y onehot, 


snape=—[s1L, sell.num Classes), 
name='y reshaped’ ) 
Cost. = tr..reduce mean( 


LE~iisSsOLtCMax Cross SNUCrOpy with Logits 
logits=logits, 
LabelLe=y FSesiaped) , 

name='cost') 


# Gradient clipping to avoid "exploding gradients" 
LVers = Lie Crainaole varvaebles() 
O7a0s, = ©ieclap Dy Glooet for 

Lisgradients (cost, tvars), 

Seliadlad clip) 
OplLIMLZer = Ci.trainm.AcamOpcimizer(Selialearming tabe) 
train op = optimizer.apply gradients ( 

Zipi(grads, tCvars), 

name='train op') 
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The train method 


The next method of the charRNN class is the train method, which 1s very similar to 
the train method described in the Project one — performing sentiment analysis of 
IMDb movie reviews using multilayer RNNs section. Here 1s the train method code, 
which will look very familiar to the sentiment analysis version we built earlier 1n this 


chapter: 


Get trelin(selt, Crain x, train 7, 
num epochs, ckpt dir='./model/') 

## Create the checkpoint directory 

## if it does not exists 

Lt NOt Os <palh.exists (Ckpt Giz); 

OS «MKOLE(CKpe Gr) 

with tf.Session(graph=self.g) as sess: 
Sess.2Un (Selt sine. Op) 
nM. Deacons = 
iterations 
for epoch in range (num epochs): 


int (train x.shape[1]/s 


# Train network 

new state sess.run(self.init 

loss 0 

## Mini-batch generator: 

bgen CrGate Datch Ceneravor ( 

Piatti Me Vela Vp elt 

for by, (batch x, batch _y) 

iteration 


epoch Datlcne 


feed 


= {tr X20" S Daven. x, 
“et Vas Dace Vy; 
"EE KeESpprob:; 0" 
Sets tie See ee 

Dalcn COsSt, .» Iew State 


L eGetet.", 


in enumerate (bgen, 


elisnum Steps) 


n bavches * num epochs 


tel state) 


snum steps) 
1) 
=. 7 


Se liE«KSCp: Prob; 
new state} 
sess.run ( 


"Cran Op", 


Selistinal Stale |, 


Feed Cicl=1eeG) 
ict Deeration a6: 


eo ee 


oO 


print ('Epoch %d/%$d Iteration 


Training loss 
epoch + 1, 


t+ Gaye the trardtned model 
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a 


oO 


qd! 


On4L" o@ 


num epochs, 
feetarOn, Dac . 


Cost) ) 


self.saver. save ( 
sess, os.path.join ( 
CRDE. cre, *anoueoS. mMOce Ing; cor.) 
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The sample method 


The final method in our CharRNN class 1s the sample method. The behavior of this 
sample method is similar to that of the predict method that we implemented in the 
Project one — performing sentiment analysis of IMDb movie reviews using multilayer 
RNNs section. However, the difference here 1s that we calculate the probabilities for 
the next character from an observed sequence—observed seq. Then, these 
probabilities are passed to a function named get top char, which randomly selects 
one character according to the obtained probabilities. 


Initially, the observed sequence starts from starter seq, which is provided as an 
argument. When new characters are sampled according to their predicted 
probabilities, they are appended to the observed sequence, and the new observed 
sequence 1s used for predicting the next character. 


The implementation of the sample method 1s as follows: 


Cet Sample(seit, OuLtpuL Length, 
Ckpt Gir, Sterter seq=—"“Ine ™) 
Observed seq = ch for Ch 1m Starter seq) 
with tf.Session(graph=self.g) as sess: 
self.saver.restore ( 
sess, 
Legtleinsleatest. Checkpoint (Ckpt dit) ) 
## 1: run the model using the starter sequence 
New State = Secs. Tun selisinitial stare) 
for Ch am Starter seq: 
xX = np.zeros((l1, 1)) 
x[0O, O] = char2int[ch] 
LeeG. = 4 EL 20" se xX; 
‘te ReCDpLOO OS 1.0; 
Self. state: Dew state) 
proba, New state — sess.rumt 
| PLObabi tities: 0", selt.finel state), 
Peed CO Lel=L.ecc) 


ch 10: = Get. Top -Chariproba, Jen (chars)) 
Observed Seq «append, (anez2cnaer (ch 10) ) 


## 2: run the model using the updated observed seq 
fOr 1. Im Tange (output Jengin) = 
x0, 0) = ch 14 
FeCO = 1."Er 20s x; 
‘tt KeCDp LOO US 1.0, 
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SCli,iMitial Stave; New stare) 

DPrOba, New State = sess. run 
L PeObe SLL ieues.0";, Selist ia) eae |, 
POC. C1CE— 1 eed) 


Cli 20 = G6 (op “Char iproba, lLenicmars)) 
ObSeGrved: Seq.appena (intzchar ich 20)) 


EeolLuin "* ~ ;Ol1m Observed, seq) 


So here, the sample method calls the get top char function to choose a character 
ID randomly (ch_ id) according to the obtained probabilities. 


In this get top char function, the probabilities are first sorted, then the top n 
probabilities are passed to the numpy. random. choice function to randomly select 
one out of these top probabilities. The implementation of the get top char function 
is as follows: 
Get get Top Char (probes; Char s17¢,. TOp n= 5); 

p = np.squeeze (probas) 

POP eetosore (pp) -EOo iil = 0.0 


p= p / np.sum(p) 
ch 10. = Npsrandom.chor1ce(char saze, 1, p=p) 10! 
Poeun cil 2 


Note, of course, that this function should be defined before the definition of the 
CharRNN Class; we've explained it 1n this order here so that we can explain the 
concepts in order. Browse through the code notebook that accompanies this chapter 
to get a better overview of the order in which the functions are defined. 
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Creating and training the CharRNN Model 


Now we're ready to create an instance of the CharRwnwn class to build the RNN model, 
and to train it with the following configurations: 


Jo? Dalen, S126 = 64 
ve> Tum Steps. = 100 
vor Crain ky, Crain YY = reshape datal(text. ants, 
Dace Size, 
ars Mune. SLeps) 
2 
>>> Vii = Char eRNN (num Chasses=Len(cChars), batch Si 76-barch size) 
Zor Iie Cialn (tial x, trait Vy, 

num epochs=100, 

Ckpu cdir="./model=100/") 


The trained model will be saved in a directory called ./mode1l-100/ so that we can 
reload it later for prediction or for continuing the training. 
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The CharRNN model in the sampling mode 


Next up, we can create a new instance of the CharRwnwn class in the sampling mode by 
specifying that sampling=True. We'll call the sample method to load the saved 
model in the . /mode1-100/ folder, and generate a sequence of 500 characters: 


>>> del rnn 


>>> 
>>> np.random.seed (123) 
>>> rnn = CharRNN(len(chars), sampling=True) 


>>> print(rnn.sample(ckpt dir='./model-100/', 
OuUcpuL. .length=300) ) 


The generated text will look like the following: 


The stall soues tay and the hates, 
The perse in there is that so the meanes this made there 


Ham. Ile teath thes are this makere of a driane, 
Why shis mestend the Casst of is singe, 
In this to this, to mers it is for marth, 


Ase hinees sim thig tald ow a tore andere, 

In histhene tistere shere this wile and my Lord: 

And tit mighes the secleer allost heruen, and that hash to sall and hears, 
If you his moses tonger and mout ofr mesting a forte tis at 


Pomin. Where in you dist and sintere shan shall 





You can see that in the resulting output, that some English words are mostly 
preserved. It's also important to note that this 1s from an old English text; therefore, 
some words 1n the original text may be unfamiliar. To get a better result, we would 
need to train the model for higher number of epochs. Feel free to repeat this with a 
much larger document and train the model for more epochs. 
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Chapter and book summary 


We hope you enjoyed this last chapter of Python Machine Learning and our exciting 
tour of machine learning and deep learning. Through the journey of this book, we've 
covered the essential topics that this field has to offer, and you should now be well 
equipped to put those techniques into action to solve real-world problems. 


We started our journey with a brief overview of the different types of learning tasks: 
supervised learning, reinforcement learning, and unsupervised learning. We then 
discussed several different learning algorithms that you can use for classification, 
starting with simple single-layer neural networks in Chapter 2, Training Simple 
Machine Learning Algorithms for Classification. 


We continued to discuss advanced classification algorithms in Chapter 3, 4 Tour of 
Machine Learning Classifiers Using scikit-learn, and we learned about the most 
important aspects of a machine learning pipeline in Chapter 4, Building Good 
Training Sets — Data Preprocessing and Chapter 5, Compressing Data via 
Dimensionality Reduction. 


Remember that even the most advanced algorithm is limited by the information in 
the training data that it gets to learn from. So in Chapter 6, Learning Best Practices 
for Model Evaluation and Hyperparameter Tuning, we learned about the best 
practices to build and evaluate predictive models, which 1s another important aspect 
in machine learning applications. 


If one single learning algorithm does not achieve the performance we desire, it can 
be sometimes helpful to create an ensemble of experts to make a prediction. We 
explored this in Chapter 7, Combining Different Models for Ensemble Learning. 


Then in Chapter 8, Applying Machine Learning to Sentiment Analysis, we applied 
machine learning to analyze one of the most popular and interesting forms of data in 
the modern age that's dominated by social media platforms on the internet—text 
documents. 


Next, we reminded ourselves that machine learning techniques are not limited to 
offline data analysis, and in Chapter 9, Embedding a Machine Learning Model into a 
Web Application, we saw how to embed a machine learning model into a web 
application to share it with the outside world. 
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For the most part, our focus was on algorithms for classification, which 1s probably 
the most popular application of machine learning. However, this is not where our 
journey ended! In Chapter 10, Predicting Continuous Target Variables with 
Regression Analysis, we explored several algorithms for regression analysis to 
predict continuous valued output values. 


Another exciting subfield of machine learning is clustering analysis, which can help 
us find hidden structures in the data, even 1f our training data does not come with the 
right answers to learn from. We worked with this in Chapter 11, Working with 
Unlabeled Data — Clustering Analysis. 


We then shifted our attention to one of one of the most exciting algorithms in the 
whole machine learning field—artificial neural networks. We started by 
implementing a multilayer perceptron from scratch with NumPy in Chapter 12, 
Implementing a Multilayer Artificial Neural Network from Scratch. 


The power of TensorFlow became obvious in Chapter 13, Parallelizing Neural 
Network Training with TensorFlow, where we used TensorFlow to facilitate the 
process of building neural network models and make use of GPUs to make the 
training of multilayer neural networks more efficient. 


We delved deeper into the mechanics of TensorFlow in Chapter 14, Going Deeper — 
The Mechanics of TensorFlow, and discussed the different aspects and mechanics of 
TensorFlow, including variables and operators in a TensorFlow computation graph, 
variable scopes, launching graphs, and different ways of executing nodes. 


In Chapter 15, Classifying Images with Deep Convolutional Neural Networks, we 
dived into convolutional neural networks, which are widely used in computer vision 
at the moment, due to their great performance 1n image classification tasks. 


Finally, here in Chapter 16, Modeling Sequential Data Using Recurrent Neural 
Networks, we learned about sequence modeling using RNNs. While a comprehensive 
study of deep learning 1s well beyond the scope of this book, we hope that we've 
kindled your interest enough to follow the most recent advancements in this field of 
deep learning. 


If you're considering a career in machine learning, or you just want to keep up to 
date with the current advancements in this field, | can recommend to you the works 
of the following leading experts in the machine learning field: 
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Geoffry Hinton (http://www.cs.toronto.edu/~hinton/) 

Andrew Ng (http://www.andrewng.org/) 

Yann LeCun (http://yann.lecun.com) 

Juergen Schmidhuber (http://people.idsia.ch/~juergen/) 

Yoshua Bengio (http://www. iro.umontreal.ca/~bengioy/yoshua_en/) 


Just to name a few! 


And of course, don't hesitate to join the scikit-learn, TensorFlow, and Keras mailing 
lists to participate in interesting discussions around these libraries and machine 
learning 1n general. Lastly, you can find out what we, the authors, are up at 
http://sebastianraschka.com and http://vahidmirjalili.com. You're always welcome to 
contact us if you have any questions about this book, or need some general tips about 
machine learning. 
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